I’m using this PHP function for SEO urls. It’s working fine with Latin words, but my urls are on Cyrillic. This regex – /[^a-z0-9_s-]/
is not working with Cyrillic chars, please help me to make it works with non-Latin chars.
function seoUrl($string) {
// Lower case everything
$string = strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^a-z0-9_s-]/', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[s_]/', '-', $string);
return $string;
}
2
Answers
You need to use a Unicode script for Cyrillic alphabet that fortunately PHP PCRE supports it using
p{Cyrillic}
. Besides you have to setu
(unicode) flag to predict engine behavior. You may also needi
flag for enabling case-insensitivity likeA-Z
:You don’t need to double escape
s
.PHP code:
To learn more about Unicode Regular Expressions see this article.
p{L}
orp{Letter}
matches any kind of letter from any language.To match only Cyrillic characters, use
p{Cyrillic}
Since Cyrillic characters are not standard ASCII characters, you have to use
u
flag/modifier, so regex will recognize Unicode characters as needed.Be sure to use
mb_strtolower
instead ofstrtolower
, as you work with unicode characters.Because you convert all characters to lowercase, you don’t have to use
i
regex flag/modifier.The following PHP code should work for you:
Furthermore, please note that
p{InCyrillic_Supplementary}
matches all Cyrillic Supplementary characters andp{InCyrillic}
matches all non-Supplementary Cyrillic characters.