I’ve read many posts that explain how to deal with Unicode characters, but none of the suggestions are working for me.
My php page reads a file that contains strings with high-order characters, e.g., "Mötor". I want to convert the strings to "normal" characters, e.g., "Motor".
This is what I have tried:
$source = "Mötor";
$test = preg_replace('/[^wdp{L}]/u', "", $source); // Returns null.
$test = preg_replace('/[^wdp{L}]/u', "", htmlentities($source)); // Returns "".
$test = preg_replace("/&([a-z])[a-z]+;/i", "$1", $source); // Returns "Mötor".
$test = preg_replace("/&([a-z])[a-z]+;/i", "$1", htmlentities($source)); // Returns "".
$test = iconv('utf-8', 'ascii//TRANSLIT', $source); // Returns false.
I am stumped. Thanks!
2
Answers
This is called "transliteration" and intl’s Transliterator will work far better than bodging together regular expressions.
Output:
A well proven way:
Output:
.SO76446827.php
Resources (required reading):
Unicode Normalization Forms
Regular expressions: Unicode Categories
p{M}
orp{Mark}
: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).–
p{Mn}
orp{Non_Spacing_Mark}
: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).p{Mc}
orp{Spacing_Combining_Mark}
: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).p{Me}
orp{Enclosing_Mark}
: a character that encloses the character it is combined with (circle, square, keycap, etc.).PHP manual: Unicode character properties (note
/u
option for Unicode support in regex)Note: test string contains accented characters of various scripts (both Western and Eastern Latin, Greek, and Cyrillic) to demonstrate script-independency of used regex:
ö
(U+00F6, Latin Small Letter O With Diaeresis)š
(U+0161, Latin Small Letter S With Caron)ř
(U+0159, Latin Small Letter R With Caron)í
(U+00ED, Latin Small Letter I With Acute)ϊ
(U+03CA, Greek Small Letter Iota With Dialytika)ί
(U+03AF, Greek Small Letter Iota With Tonos)ї
(U+0457, Cyrillic Small Letter Yi)