skip to Main Content

From here Levenshtein distance on diacritic characters
I’m using this PHP function to calculate the levenshtein distance for UTF8 characters.

function levenshtein_php($str1, $str2){
  $length1 = mb_strlen( $str1, 'UTF-8');
  $length2 = mb_strlen( $str2, 'UTF-8');
  if( $length1 < $length2) return levenshtein_php($str2, $str1);
  if( $length1 == 0 ) return $length2;
  if( $str1 === $str2) return 0;
  $prevRow = range( 0, $length2);
  $currentRow = array();
  for ( $i = 0; $i < $length1; $i++ ) {
    $currentRow=array();
    $currentRow[0] = $i + 1;
    $c1 = mb_substr( $str1, $i, 1, 'UTF-8') ;
    for ( $j = 0; $j < $length2; $j++ ) {
      $c2 = mb_substr( $str2, $j, 1, 'UTF-8' );
      $insertions = $prevRow[$j+1] + 1;
      $deletions = $currentRow[$j] + 1;
      $substitutions = $prevRow[$j] + (($c1 != $c2)?1:0);
      $currentRow[] = min($insertions, $deletions, $substitutions);
    }
    $prevRow = $currentRow;
  }
  return $prevRow[$length2];
}

When I’m using it with
$string1 = ‘încat de counștițe’;
$string2= ‘încântat’;

I get that the levensthein difference is 13 , which I consider to be wrong.

The only 2 options that I see are the following changes to $string1:

    1)
    a) add the characters 'ânt' to reach $string1 = 'încântat de counștițe';
    b) delete the characters ' de counștițe' to reach $string1 = 'încântat' ;

which would lead to a difference of 17 changes


    2) 
    a) replace 'a' with 'â' to reach $string1 = 'încât de counștițe';
    b) delete 't de cou' to reach $string1 = 'încânștițe';
    c) delete 'ș' to reach $string1 = 'încântițe';
    d) add characters 'at' to reach to reach $string1 = 'încântatițe';
    e) remove characters 'ițe' to reach $string1 = 'încântat';

with an levensthein distance of 15

Could you please help me correct the above levensthein_php code to return the correct difference. If there is any other existing PHP function I’d be happy to use it, however I unerstood that there is no function mb_levenshtein.

If it matters I’m running it on PHP 7.4.33.

2

Answers


  1. Your example text implies that the character set is limited to certain Western European alphabets. I suggest you use the suitable conversion routine to change from UTF-8 to Latin1 (or other suitable encoding). Then the characters would be single-byte, and no "mb" routines are needed.

    Login or Signup to reply.
  2. Your second example of calculating the distance contains an error in my opinion:

    a) replace 'a' with 'â' to reach $string1 = 'încât de counștițe';
    b) delete 't de cou' to reach $string1 = 'încânștițe';
    c) delete 'ș' to reach $string1 = 'încântițe';
    d) add characters 'at' to reach $string1 = 'încântatițe';
    e) remove characters 'ițe' to reach $string1 = 'încântat';
    

    The last two lines should read:

    d) replace characters 'iț' with 'at' to reach $string1 = 'încântate';
    e) remove character 'e' to reach $string1 = 'încântat';
    

    Therefore the result is 13 and not 15, and the function works correctly.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search