Perl Text Comparison/Similarity Final

June 23, 2005 Kevin R Keegan

I have received a couple of suggestions via email in regards to how best compare two text strings. In my most recent post I discussed a the String::Compare module. Since then I have found some additional modules that do similar things.

I did my own personal test and review of each of the following modules and a few additional concepts. Some of the included modules are: String::Similarity, Text::Levenshtein , String::Approx, Text::Similarity, Algorithm::HowSimilar, Compressed file concept.?

It is important to remember that I am trying to compare Music Album titles and artists. These are short strings and may contain non-alphanumeric keys.

As I discussed in my last post String::Compare?was a great concept and certainly a good starting point but it has a few flaws. The major flaw I found was that if string1 contained "&" and string2 contained "and" this would cause a low match generally below what I would automatically recognize.

String::Similarity

This is my suggestion for short string comparison. This is based on the Levenshtein method. Stated simply the Levenshtein method counts how many changes are required to to make string1 identical to string2. So "Foo" and "Fou" would require one change and "Foo" and "Bar" require 3. This module returns a value between 0 and 1, 1 being identical.
String::Similarity

Text::Levenshtein

This is same method as I mentioned above, but this module returns the distance or number of edits required to make the strings identical. This is not as usefull for me since I want to compare things on a percentage basis and this will return the actual number of changes in an integer.
Text::Levenshtein

String::Approx

This is a more complex comparison tool. This module is designed to work with strings and allows you to program specific variables such as tolerance for different types of differences. This module is more complex than what I need.
String::Approx

Text::Similarity

This is almost identical to String::Similarity except that it is designed to take to complete files and compare them. While very cool this is not what I am looking for I just need to compare two short strings.
Text::Similarity

Algorithm::HowSimilar

This module uses the unix Diff algorithm to calculate the difference between two items. The major draw back is that if you have strings of different lengths the results start getting a little fuzzy.
Algorithm::HowSimilar

Compressed file concept

This is a concept discussed on Perl Monks and per the author was discussed in Scientific American at one point. The concept is that the compressed version of the two text files should have similar size and an analysis of the resulting compressed files should reveal their similarity. Very interesting concept that I am sure works well, however as someone had commented it would not work very well for album titles because they are so short.
Compressed file concept

Conculsion

So far String::Similarity?has worked well for me. I have found that anything above 95% is always correct, 85% above is probably correct, 80% and above is a toss up and anything below 80% is almost always wrong.

There are many other modules out there as well that also will compute similarity I have just not made it to testing them yet.

Cool Module

One last module that I found while browsing is WordNet::Similarity. It uses a dictionary to determine how related two documents are. While straight text comparison would tell you that "dog" and "cat" are not even close to similar WordNet would recognize that they are both animals. Very cool idea.