skip to Main Content

what is php regex for check if upload file name have german umlauts?
file name : Screenshot_Erdös.png

i tried below but not working

if ( preg_match('(?<![äöüÄÖÜßw])([äöüÄÖÜßw]+)(?![äöüÄÖÜßw])', $file_name )){
        $file['error'] = __( "WARNING: Invalid file name. German umlauts are not allowed.", 'wp-file' );
    }

2

Answers


  1. You can try this

    if (preg_match('/[äöüÄÖÜß]/', $file_name)) {
        $file['error'] = __("WARNING: Invalid file name. German umlauts are not allowed.", 'wp-file');
    }
    
    Login or Signup to reply.
  2. There are two ways to produce an "a with umlaut (or diaeresis)" with UNICODE:

    • the "readymade" character: codepoint U+00E4 ä LATIN SMALL LETTER A WITH DIAERESIS
    • the combination of two codepoints the U+0061 a LATIN SMALL LETTER A), followed by U+0308 ̈ COMBINING DIAERESIS

    All other vowels "e i o u" and the "y" too are in the same situation: there are these two ways to produce them.

    To deal with this state of affairs, you can simply consider the two possibilities in your pattern, but you can also use the Normalizer from intl to convert the string to NFC before.


    Other thing to take in account, when you have to deal with multibyte characters (that is the case in UTF-8 for accented characters), you need to inform the regex engine, otherwise this one will read the subject string and the pattern byte by byte instead of codepoint by codepoint.

    Consider this character class: [ä] (with the "readymade" small A with diaeresis). ä is encoded with two bytes in UTF-8: C3 A4.
    That means that by default a pattern with this character class will succeed if one of these two byte is found in the subject string. But that doesn’t mean that the subject string contains ä:

    var_dump(preg_match('~[ä]~', '↤')); // int(1)
    

    This pattern succeeds because U+21A4 LEFTWARDS ARROW FROM BAR is encoded with the bytes E2 86 A4 and the byte A4 is found.

    To inform the regex engine that the strings (the pattern and the subject) have to be read codepoint by codepoint, you can start the pattern like that:

    var_dump(preg_match('~(*UTF8)[ä]~', '↤')); // int(0)
    

    or use the u modifier:

    var_dump(preg_match('~[ä]~u', '↤')); // int(0)
    

    To conclude, a pattern to match a diaeresis can be written like that:

    preg_match('~[äëïöüÿN{U+00A8}N{U+0308}]~ui', $subject)
    

    or

    preg_match('~[äëïöüÿN{U+00A8}]|[aeiouy]N{U+0308}~ui', $subject)
    

    where N{U+0308} stands for the combining diaeresis and N{U+00A8} for the diaeresis alone. äëïöüÿ are "readymade" characters from the UNICODE block U+0080 -> U+00FF Latin-1 supplement. Uppercase letters are taken in account with the i modifier.

    or like that:

    preg_match('~(*UTF8)[äëïöüÿN{U+00A8}N{U+0308}]~i', $subject)
    

    or with a NFC normalized string:

    preg_match('~[äëïöüÿN{U+00A8}]~ui', normalizer_normalize($subject))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search