Php - Yet another Unicode preg_replace() question

SteveA
June 13, 2023
185 views
0 votes
2 Answers

I’ve read many posts that explain how to deal with Unicode characters, but none of the suggestions are working for me.

My php page reads a file that contains strings with high-order characters, e.g., "Mötor". I want to convert the strings to "normal" characters, e.g., "Motor".

This is what I have tried:

$source = "Mötor";
$test = preg_replace('/[^wdp{L}]/u', "", $source); // Returns null.
$test = preg_replace('/[^wdp{L}]/u', "", htmlentities($source)); // Returns "".
$test = preg_replace("/&([a-z])[a-z]+;/i", "$1", $source); // Returns "Mötor".
$test = preg_replace("/&([a-z])[a-z]+;/i", "$1", htmlentities($source)); // Returns "".
$test = iconv('utf-8', 'ascii//TRANSLIT', $source); // Returns false.

I am stumped. Thanks!

Tags: php unicode-string

Answers

- Sammitch
- June 10, 2023 at 9:40 pm
- 0 votes
0
This is called "transliteration" and intl’s Transliterator will work far better than bodging together regular expressions.
```
$tests = [ "Mötor" ];

$tl = Transliterator::create('Latin-ASCII;');
foreach($tests as $str) {
    var_dump(
        $tl->transliterate($str)
    );
}
```
Output:
```
string(5) "Motor"
```
Login or Signup to reply.

- JosefZ
- June 11, 2023 at 9:22 pm
- 0 votes
0
A well proven way:
```
<?php
$source = "Mötor, šeřík, Προϊστορία, Україна";
var_dump( $source);
var_dump( preg_replace("/p{Mn}/u", '',
            Normalizer::normalize( $source, Normalizer::FORM_D )));
?>
```
Output: .SO76446827.php
```
string(54) "Mötor, šeřík, Προϊστορία, Україна"
string(50) "Motor, serik, Προιστορια, Украіна"
```
Resources (required reading):
- Unicode Normalization Forms
- Regular expressions: Unicode Categories
  - p{M} or p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
    – p{Mn} or p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
    
    p{Mc} or p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
    
    p{Me} or p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
- PHP manual: Unicode character properties (note /u option for Unicode support in regex)
Note: test string contains accented characters of various scripts (both Western and Eastern Latin, Greek, and Cyrillic) to demonstrate script-independency of used regex:
- ö (U+00F6, Latin Small Letter O With Diaeresis)
- š (U+0161, Latin Small Letter S With Caron)
- ř (U+0159, Latin Small Letter R With Caron)
- í (U+00ED, Latin Small Letter I With Acute)
- ϊ (U+03CA, Greek Small Letter Iota With Dialytika)
- ί (U+03AF, Greek Small Letter Iota With Tonos)
- ї (U+0457, Cyrillic Small Letter Yi)
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Php – Yet another Unicode preg_replace() question

Answers

Resources (required reading):