Php - Wrong levenshtein UTF8

MMT
October 7, 2023
248 views
0 votes
2 Answers

From here Levenshtein distance on diacritic characters
I’m using this PHP function to calculate the levenshtein distance for UTF8 characters.

function levenshtein_php($str1, $str2){
  $length1 = mb_strlen( $str1, 'UTF-8');
  $length2 = mb_strlen( $str2, 'UTF-8');
  if( $length1 < $length2) return levenshtein_php($str2, $str1);
  if( $length1 == 0 ) return $length2;
  if( $str1 === $str2) return 0;
  $prevRow = range( 0, $length2);
  $currentRow = array();
  for ( $i = 0; $i < $length1; $i++ ) {
    $currentRow=array();
    $currentRow[0] = $i + 1;
    $c1 = mb_substr( $str1, $i, 1, 'UTF-8') ;
    for ( $j = 0; $j < $length2; $j++ ) {
      $c2 = mb_substr( $str2, $j, 1, 'UTF-8' );
      $insertions = $prevRow[$j+1] + 1;
      $deletions = $currentRow[$j] + 1;
      $substitutions = $prevRow[$j] + (($c1 != $c2)?1:0);
      $currentRow[] = min($insertions, $deletions, $substitutions);
    }
    $prevRow = $currentRow;
  }
  return $prevRow[$length2];
}

When I’m using it with
$string1 = ‘încat de counștițe’;
$string2= ‘încântat’;

I get that the levensthein difference is 13 , which I consider to be wrong.

The only 2 options that I see are the following changes to $string1:

    1)
    a) add the characters 'ânt' to reach $string1 = 'încântat de counștițe';
    b) delete the characters ' de counștițe' to reach $string1 = 'încântat' ;

which would lead to a difference of 17 changes


    2) 
    a) replace 'a' with 'â' to reach $string1 = 'încât de counștițe';
    b) delete 't de cou' to reach $string1 = 'încânștițe';
    c) delete 'ș' to reach $string1 = 'încântițe';
    d) add characters 'at' to reach to reach $string1 = 'încântatițe';
    e) remove characters 'ițe' to reach $string1 = 'încântat';

with an levensthein distance of 15

Could you please help me correct the above levensthein_php code to return the correct difference. If there is any other existing PHP function I’d be happy to use it, however I unerstood that there is no function mb_levenshtein.

If it matters I’m running it on PHP 7.4.33.

Tags: levenshtein-distance php

Answers

- RickJames
- October 7, 2023 at 7:59 pm
- 0 votes
0
Your example text implies that the character set is limited to certain Western European alphabets. I suggest you use the suitable conversion routine to change from UTF-8 to Latin1 (or other suitable encoding). Then the characters would be single-byte, and no "mb" routines are needed.

Login or Signup to reply.

- lukasj
- October 7, 2023 at 8:13 pm
- 0 votes
0
Your second example of calculating the distance contains an error in my opinion:
```
a) replace 'a' with 'â' to reach $string1 = 'încât de counștițe';
b) delete 't de cou' to reach $string1 = 'încânștițe';
c) delete 'ș' to reach $string1 = 'încântițe';
d) add characters 'at' to reach $string1 = 'încântatițe';
e) remove characters 'ițe' to reach $string1 = 'încântat';
```
The last two lines should read:
```
d) replace characters 'iț' with 'at' to reach $string1 = 'încântate';
e) remove character 'e' to reach $string1 = 'încântat';
```
Therefore the result is 13 and not 15, and the function works correctly.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Php – Wrong levenshtein UTF8

Answers