Textual tolerance with the help of perl

In many occasions, you have text entered that doesn't match what is expected, because of typing or spelling mistakes. With perl, we can find many ways to establish some sort of a textual tolerance.

The first obvious choice is to use the Text::Soundex module, an implementation of the Donald E. Knuth algorithm that has the great privilege to be part of the standard perl distribution. The soundex(); function returns a phonetic value to the word entered, in the shape of a four characters string (one upper case letter and three digits). Two words phonetically equivalent will return the same value, two close words, close values. The drawback is that this module is only designed for english language, implying a different perspective on phonetic issues, different to most of the other languages like french in my exemple. I am not aware of any similar implementation of phonetics to other language in perl but i'd be happy to be proven wrong. Even though it is english centric, it can still be used for many european languages, with less precision though. It is very usefull to find similar instances, to get closer to a keyword in a thesaurus, index or dictionary.

The major opponent of the Soundex method can be found in the Text::Metaphone module. More modern and also based on english, this module grants you access to the Metaphone() fonction that takes a string and converts it to it's phonetic equivalent using a special set of characters. It's also a slow process but the output is richer than that of soundex so it can be used in certain cases. The downside is that the result is not strictly formated so it is harder to further manipulate it. The good news is that there's now a Text::TransMetaphone module that is meant to provide internationalization of the phonetic encoding process. At the time of writing, it is limited to arabic, greek, hebrew, russian or japanese, to name a few and most important ones. This means that there are very few european languages by now but the project is ambitious and we can potentially hope for further extensions.

The second approach, more focused on typing mistakes, is based on the comparison of pairs of letters between the words, in view of the length of the strings. We deal with pairs in order to get more relevant calculations in a simple way. That can be implemented this way, with two strings $a et $b, the first being conventionally the longest:

$nbfault = 0;
for $i (0..length($a)-2) {
  $suba = substr($a,$i,2);
  $nbfault++ unless (($suba eq substr($b,$i,2)) or ($suba eq substr($b,$i-1,2)));
}
print $nbfault;

So we get 1 in case of a double, and 2 in case of an added letter or substitution. This code is aimed at single letter mistakes but can be easily extended. For the frequent case of exchanged/twisted letters, it only takes another new condition ($suba eq reverse(substr($b,$i,2))) to be added in order to get a 2 result instead of a 3 if the mistake is inside the word. In the end, with a result inferior or equal to 2, you can conclude that a typing mistake has occured, whatever the language.

If we compare the two methods with a simple benchmark:

use Benchmark;
$timer = new Benchmark;
for $i (1..10000) {
  ...
}
$timez = new Benchmark;
$td = timediff($timer,$timez);
print timestr($td);

We can see that soundex is much more expensive in terms of processor time, on an overage scale of 5 to 1 (variable depending upon the platform). That is far from neglectable but it's aim is much more ambitious.

The String::Approx module gives us another solution. It allow fuzzy matching of strings of text. In fact, it measures the "Levenshtein edit distance", evaluating the number of errors encountered. It is aware of three type of errors insertions, deletions and substitutions. It's internal process is to evaluate the number of steps fron one word to another. If we are looking for only one level of error, we can use the amatch function that deals with mistakes in the 10% range by default.

With $oj being our original word and @base containing the list of the possible words, to get @possibl fulll of possible matches, we can code :

use String::Approx qw(amatch);
@possibl = amatch($oj,@base);

In scalar context, amatch returns the error count. However, albeit recent improvements in this module, it remains much slower than our approach.

It can be very useful in extreme cases to combine soundex and pathern matching because phonetically close words tend to return simiral codes, with differences being in only one of the four characters :

use Text:Soundex;
$vala = soundex($a);
$valb = soundex($b);
$total = 0;
for $i (0..3) {
  $total++ if (substr($vala,$i,1) eq substr($valb,$i,1));
}
print "OK" if ($total>2);

But in this case, you get to a point where you tend to try to decipher a similarity that can be way too thin, beyond phonetics. This type of operation cannot be easily done with the Text::Metaphone module.

PS: If you were only interested in textual tolerance in perl, just focus on explicit variable names and stick to a good use strict to stay out or trouble or use the Symbol::Approx::Sub at your own risk.

main menu