skip to Main Content

I’ve read many posts that explain how to deal with Unicode characters, but none of the suggestions are working for me.

My php page reads a file that contains strings with high-order characters, e.g., "Mötor". I want to convert the strings to "normal" characters, e.g., "Motor".

This is what I have tried:

$source = "Mötor";
$test = preg_replace('/[^wdp{L}]/u', "", $source); // Returns null.
$test = preg_replace('/[^wdp{L}]/u', "", htmlentities($source)); // Returns "".
$test = preg_replace("/&([a-z])[a-z]+;/i", "$1", $source); // Returns "Mötor".
$test = preg_replace("/&([a-z])[a-z]+;/i", "$1", htmlentities($source)); // Returns "".
$test = iconv('utf-8', 'ascii//TRANSLIT', $source); // Returns false.

I am stumped. Thanks!

2

Answers


  1. This is called "transliteration" and intl’s Transliterator will work far better than bodging together regular expressions.

    $tests = [ "Mötor" ];
    
    $tl = Transliterator::create('Latin-ASCII;');
    foreach($tests as $str) {
        var_dump(
            $tl->transliterate($str)
        );
    }
    

    Output:

    string(5) "Motor"
    
    Login or Signup to reply.
  2. A well proven way:

    <?php
    $source = "Mötor, šeřík, Προϊστορία, Україна";
    var_dump( $source);
    var_dump( preg_replace("/p{Mn}/u", '',
                Normalizer::normalize( $source, Normalizer::FORM_D )));
    ?>
    

    Output: .SO76446827.php

    string(54) "Mötor, šeřík, Προϊστορία, Україна"
    string(50) "Motor, serik, Προιστορια, Украіна"
    

    Resources (required reading):

    • Unicode Normalization Forms

    • Regular expressions: Unicode Categories

      • p{M} or p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
        p{Mn} or p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).

        • p{Mc} or p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
        • p{Me} or p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
    • PHP manual: Unicode character properties (note /u option for Unicode support in regex)

    Note: test string contains accented characters of various scripts (both Western and Eastern Latin, Greek, and Cyrillic) to demonstrate script-independency of used regex:

    • ö (U+00F6, Latin Small Letter O With Diaeresis)
    • š (U+0161, Latin Small Letter S With Caron)
    • ř (U+0159, Latin Small Letter R With Caron)
    • í (U+00ED, Latin Small Letter I With Acute)
    • ϊ (U+03CA, Greek Small Letter Iota With Dialytika)
    • ί (U+03AF, Greek Small Letter Iota With Tonos)
    • ї (U+0457, Cyrillic Small Letter Yi)
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search