Perl Text Comparison Followup

June 3, 2005 Kevin R Keegan

I recently posted a request for information about any perl modules out there that would be usefull in comparing two strings of text. I found String::Compare on cpan and I have been happy with its simplicity and speed.

I wrote a summary below of how the module works and what my opinions are.

Summary

This module takes two strings and runs them through a series of tests and returns a number between 1 and 0. The closer the result is to 1 the more alike the strings are. The module is very simple, it has 5 tests that it runs. First it checks to see how alike the strings are in a character by character analysis. Then it checks to see how many of the consonants are alike. Then it checks to see how many vowels are alike. Then it looks at word breaks and compares each word. Finally it looks at only characters skipping all symbols and numbers.

Results

I found that this method works well, but not perfect. The module is extrememly fast and adds very little overhead to the script which I like. Most of the results of .80 and above are nearly identical and in my script could be assumed to be matched.

The one major problem I had with the module is that it needs an addition of some common variances. Such as "and" and "&". If one string is "Rattle and Hum" and the other string is "Rattle & Hum" this will cause a lower scrore even though the strings are nearly identical. There are other examples I am sure.

In the future I think these issues should be addressed, but at the moment I am not sure how.

If you have seen other perl modules that do this task better please comment and let me know.

You can also read a very in depth response to a google answer here.