skip to Main Content

I’m having some trouble matching/replacing the ZWSP unicode encoded as UTF8

ZWSP: x20x0B
ZWSP (UTF8): xE2x80x8B

As an extra test case I have used NBSP (Non-breaking space) which works as expected

All preg_replace are in UTF8 mode /u

  • When matching NBSP it works as expected. The input is encoded as UTF8 and the output is empty (NBSP unicode replaced with an empty string)

  • When matching ZWSP it only works if the ZWSP input is not UTF8 encoded.

  • If you change the ZWSP pattern to the UTF8 encoded version and keep input as UTF8 it doesn’t work either

Q: Then how to match ZWSP in UTF8 ?

… or is this a bug?

code

$nbsp       = 'xA0'; // Non-breaking space
$zwsp       = 'x20x0B'; // Zero-width space
$zwsp_utf8  = 'xE2x80x8B';

$input_nbsp_utf8    = "xC2xA0";
$input_zwsp         = "x20x0B";
$input_zwsp_utf8    = "xE2x80x8B";

// NBSP
echo "NBSPn-----n";
echo "in: $input_nbsp_utf8--nhex: ".bin2hex($input_nbsp_utf8)."n";
$output = preg_replace('/'.$nbsp.'/u', '', $input_nbsp_utf8);
echo "out: $output--nhex: ".bin2hex($output)."nn";

// ZWSP (input: **not** UTF8)
echo "ZWSP (input: **not** UTF8)n-----n";
echo "in: $input_zwsp--nhex: ".bin2hex($input_zwsp)."n";
$output = preg_replace('/'.$zwsp.'/u', '', $input_zwsp);
echo "out: $output--nhex: ".bin2hex($output)."nn";

// ZWSP (input: UTF8)
echo "ZWSP (input: UTF8)n-----n";
echo "in: $input_zwsp_utf8--nhex: ".bin2hex($input_zwsp_utf8)."n";
$output = preg_replace('/'.$zwsp.'/u', '', $input_zwsp_utf8);
echo "out: $output--nhex: ".bin2hex($output)."nn";

// ZWSP (pattern: UTF8, input: UTF8)
echo "ZWSP (pattern: UTF8, input: UTF8)n-----n";
echo "in: $input_zwsp_utf8--nhex: ".bin2hex($input_zwsp_utf8)."n";
$output = preg_replace('/'.$zwsp_utf8.'/u', '', $input_zwsp_utf8);
echo "out: $output--nhex: ".bin2hex($output)."nn";

Output

NBSP
-----
in:  --
hex: c2a0
out: --
hex:

ZWSP (input: **not** UTF8)
-----
in:
     --
hex: 200b
out: --
hex:

ZWSP (input: UTF8)
-----
in: ​--
hex: e2808b
out: ​--
hex: e2808b // Output should be empty

ZWSP (pattern: UTF8, input: UTF8)
-----
in: ​--
hex: e2808b
out: ​--
hex: e2808b // Output should be empty

2

Answers


  1. Like many people, you seem to be confused about what UTF-8 is. UTF-8 isn’t a setting which is on or off, it is one of many different ways of turning text into binary data, and interpreting that binary data to get back the text.

    I’m not sure where x20x0B came from, or what it has to do with anything, but saying something is "not UTF-8" is like saying a word is "not French", or a piece of meat is "not chicken".

    Ignoring that part, let’s look at the key piece of code:

    $input_zwsp_utf8 = "xE2x80x8B";
    $output = preg_replace('/xE2x80x8B/u', '', $input_zwsp_utf8);
    

    You have provided the /u modifier, about which the manual says:

    Pattern and subject strings are treated as UTF-8.

    Then you’ve matched using the xhh notation, which is described under escape sequences:

    After "x", up to two hexadecimal digits are read (letters can be in upper or lower case). In UTF-8 mode, "x{…}" is allowed, where the contents of the braces is a string of hexadecimal digits. It is interpreted as a UTF-8 character whose code number is the given hexadecimal number. The original hexadecimal escape sequence, xhh, matches a two-byte UTF-8 character if the value is greater than 127.

    This is a bit confusing, but it’s saying that normally, xE2 would match the binary byte E2, i.e. 11100010; but with /u active, it will instead match the Unicode code point U+00E2, which is "Latin Small Letter a With Circumflex".

    Example:

    $input = 'â';
    
    echo "in: $inputnhex: ".bin2hex($input)."n";
    $output = preg_replace('/xE2/u', '', $input);
    echo "out: $outputnhex: ".bin2hex($output)."nn";
    

    Output:

    in: â
    hex: c3a2
    out: 
    hex:
    

    What it won’t match is Unicode Code Point U+200B, "Zero-Width Space".

    So, either treat your string as binary, don’t use the /u modifier, and look for the expected string of bytes:

    $input_zwsp_utf8 = "xE2x80x8B";
    $output = preg_replace('/xE2x80x8B/', '', $input_zwsp_utf8);
    

    Or, treat your string as UTF-8, and look for the code point you’re interested in:

    $input_zwsp_utf8 = "xE2x80x8B";
    $output = preg_replace('/x{200B}/u', '', $input_zwsp_utf8);
    

    [Live Demo]

    Login or Signup to reply.
  2. You can match an UTF-8 string using an ASCII mode regex – in this case you are matching the separate bytes. If you are using a regexp in UTF-8 mode, the input HAS to be a valid UTF-8 string.

    Matching Codepoints

    // input with UTF-8 byte sequences
    $input    = "NBSP: |xC2xA0|, ZWSP: |xE2x80x8B|";
    $patterns = [
        // codepoint defined in PHP string literal with u{}
        "([u{00A0}u{200B}])u",
        // codepoint defined in PCRE with x{}
        '([\x{00A0}\x{200B}])u',
    ];
    
    foreach ($patterns as $pattern) {
        var_dump(
            [
                'pattern' => $pattern,
                'input' => $input,
                'output' => preg_replace($pattern, '-', $input),
            ]
        );
    }
    

    Output:

    array(3) {
      ["pattern"]=>
      string(10) "([ ​])u"
      ["input"]=>
      string(23) "NBSP: | |, ZWSP: |​|"
      ["output"]=>
      string(20) "NBSP: |-|, ZWSP: |-|"
    }
    array(3) {
      ["pattern"]=>
      string(21) "([x{00A0}x{200B}])u"
      ["input"]=>
      string(23) "NBSP: | |, ZWSP: |​|"
      ["output"]=>
      string(20) "NBSP: |-|, ZWSP: |-|"
    }
    

    PHP String Literals

    You can use u{XXXX} to define a codepoint. This will only work in a double quoted strings. As you can see in the output the pattern contains the actual unicode characters in UTF-8 encoding. The character class shows some space only. This would work for the input string also. It could be written as "NBSP: |u{00A0}|, ZWSP: |u{200B}|".

    PCRE Codepoint Definition

    The second pattern is using PCRE syntax for the codepoint: x{XXXX}. should be escaped in PHP string literals (here is an fallback but it is always good to be explicit). You can see the codepoint definition in the pattern output.

    Matching Bytes

    You can match the bytes. In this case the regex will not be in UTF-8 and the input string is treated as bytes. This has some implications like that you can not use character classes – they only match single bytes in this mode.

    $input    = "NBSP: |xC2xA0|, ZWSP: |xE2x80x8B|";
    $patterns = [
        // matching as bytes
        '((?:\xC2\xA0|\xE2\x80\x8B))',
    ];
    
    foreach ($patterns as $pattern) {
        var_dump(
            [
                'pattern' => $pattern,
                'input' => $input,
                'output' => preg_replace($pattern, '-', $input),
            ]
        );
    }
    

    Ouput:

    array(3) {
      ["pattern"]=>
      string(27) "((?:xC2xA0|xE2x80x8B))"
      ["input"]=>
      string(23) "NBSP: | |, ZWSP: |​|"
      ["output"]=>
      string(20) "NBSP: |-|, ZWSP: |-|"
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search