skip to Main Content

I’m using this PHP function for SEO urls. It’s working fine with Latin words, but my urls are on Cyrillic. This regex – /[^a-z0-9_s-]/ is not working with Cyrillic chars, please help me to make it works with non-Latin chars.

function seoUrl($string) {
    // Lower case everything
    $string = strtolower($string);
    // Make alphanumeric (removes all other characters)
    $string = preg_replace('/[^a-z0-9_s-]/', '', $string);
    // Clean up multiple dashes or whitespaces
    $string = preg_replace('/[s-]+/', ' ', $string);
    // Convert whitespaces and underscore to dash
    $string = preg_replace('/[s_]/', '-', $string);
    return $string;
}

2

Answers


  1. You need to use a Unicode script for Cyrillic alphabet that fortunately PHP PCRE supports it using p{Cyrillic}. Besides you have to set u (unicode) flag to predict engine behavior. You may also need i flag for enabling case-insensitivity like A-Z:

    ~[^p{Cyrillic}a-z0-9_s-]~ui
    

    You don’t need to double escape s.

    PHP code:

    preg_replace('~[^p{Cyrillic}a-z0-9_s-]+~ui', '', $string);
    
    Login or Signup to reply.
  2. To learn more about Unicode Regular Expressions see this article.

    p{L} or p{Letter} matches any kind of letter from any language.

    To match only Cyrillic characters, use p{Cyrillic}

    Since Cyrillic characters are not standard ASCII characters, you have to use u flag/modifier, so regex will recognize Unicode characters as needed.

    Be sure to use mb_strtolower instead of strtolower, as you work with unicode characters.

    Because you convert all characters to lowercase, you don’t have to use i regex flag/modifier.


    The following PHP code should work for you:

    function seoUrl($string) {
        // Lower case everything
        $string = mb_strtolower($string);
        // Make alphanumeric (removes all other characters)
        $string = preg_replace('/[^p{Cyrillic}a-z0-9s_-]+/u', '', $string);
        // Clean up multiple dashes or whitespaces
        $string = preg_replace('/[s-]+/', ' ', $string);
        // Convert whitespaces and underscore to dash
        $string = preg_replace('/[s_]/', '-', $string);
        return $string;
    }
    

    Furthermore, please note that p{InCyrillic_Supplementary} matches all Cyrillic Supplementary characters and p{InCyrillic} matches all non-Supplementary Cyrillic characters.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search