I’m having some trouble matching/replacing the ZWSP unicode encoded as UTF8
ZWSP: x20x0B
ZWSP (UTF8): xE2x80x8B
As an extra test case I have used NBSP (Non-breaking space) which works as expected
All preg_replace
are in UTF8 mode /u
-
When matching NBSP it works as expected. The input is encoded as UTF8 and the output is empty (NBSP unicode replaced with an empty string)
-
When matching ZWSP it only works if the ZWSP input is not UTF8 encoded.
-
If you change the ZWSP pattern to the UTF8 encoded version and keep input as UTF8 it doesn’t work either
Q: Then how to match ZWSP in UTF8 ?
… or is this a bug?
code
$nbsp = 'xA0'; // Non-breaking space
$zwsp = 'x20x0B'; // Zero-width space
$zwsp_utf8 = 'xE2x80x8B';
$input_nbsp_utf8 = "xC2xA0";
$input_zwsp = "x20x0B";
$input_zwsp_utf8 = "xE2x80x8B";
// NBSP
echo "NBSPn-----n";
echo "in: $input_nbsp_utf8--nhex: ".bin2hex($input_nbsp_utf8)."n";
$output = preg_replace('/'.$nbsp.'/u', '', $input_nbsp_utf8);
echo "out: $output--nhex: ".bin2hex($output)."nn";
// ZWSP (input: **not** UTF8)
echo "ZWSP (input: **not** UTF8)n-----n";
echo "in: $input_zwsp--nhex: ".bin2hex($input_zwsp)."n";
$output = preg_replace('/'.$zwsp.'/u', '', $input_zwsp);
echo "out: $output--nhex: ".bin2hex($output)."nn";
// ZWSP (input: UTF8)
echo "ZWSP (input: UTF8)n-----n";
echo "in: $input_zwsp_utf8--nhex: ".bin2hex($input_zwsp_utf8)."n";
$output = preg_replace('/'.$zwsp.'/u', '', $input_zwsp_utf8);
echo "out: $output--nhex: ".bin2hex($output)."nn";
// ZWSP (pattern: UTF8, input: UTF8)
echo "ZWSP (pattern: UTF8, input: UTF8)n-----n";
echo "in: $input_zwsp_utf8--nhex: ".bin2hex($input_zwsp_utf8)."n";
$output = preg_replace('/'.$zwsp_utf8.'/u', '', $input_zwsp_utf8);
echo "out: $output--nhex: ".bin2hex($output)."nn";
Output
NBSP
-----
in: --
hex: c2a0
out: --
hex:
ZWSP (input: **not** UTF8)
-----
in:
--
hex: 200b
out: --
hex:
ZWSP (input: UTF8)
-----
in: --
hex: e2808b
out: --
hex: e2808b // Output should be empty
ZWSP (pattern: UTF8, input: UTF8)
-----
in: --
hex: e2808b
out: --
hex: e2808b // Output should be empty
2
Answers
Like many people, you seem to be confused about what UTF-8 is. UTF-8 isn’t a setting which is on or off, it is one of many different ways of turning text into binary data, and interpreting that binary data to get back the text.
I’m not sure where
x20x0B
came from, or what it has to do with anything, but saying something is "not UTF-8" is like saying a word is "not French", or a piece of meat is "not chicken".Ignoring that part, let’s look at the key piece of code:
You have provided the
/u
modifier, about which the manual says:Then you’ve matched using the
xhh
notation, which is described under escape sequences:This is a bit confusing, but it’s saying that normally,
xE2
would match the binary byteE2
, i.e.11100010
; but with/u
active, it will instead match the Unicode code pointU+00E2
, which is "Latin Small Letter a With Circumflex".Example:
Output:
What it won’t match is Unicode Code Point
U+200B
, "Zero-Width Space".So, either treat your string as binary, don’t use the
/u
modifier, and look for the expected string of bytes:Or, treat your string as UTF-8, and look for the code point you’re interested in:
[Live Demo]
You can match an UTF-8 string using an ASCII mode regex – in this case you are matching the separate bytes. If you are using a regexp in UTF-8 mode, the input HAS to be a valid UTF-8 string.
Matching Codepoints
Output:
PHP String Literals
You can use
u{XXXX}
to define a codepoint. This will only work in a double quoted strings. As you can see in the output the pattern contains the actual unicode characters in UTF-8 encoding. The character class shows some space only. This would work for the input string also. It could be written as"NBSP: |u{00A0}|, ZWSP: |u{200B}|"
.PCRE Codepoint Definition
The second pattern is using PCRE syntax for the codepoint:
x{XXXX}
.should be escaped in PHP string literals (here is an fallback but it is always good to be explicit). You can see the codepoint definition in the pattern output.
Matching Bytes
You can match the bytes. In this case the regex will not be in UTF-8 and the input string is treated as bytes. This has some implications like that you can not use character classes – they only match single bytes in this mode.
Ouput: