UTF8 null character & normalizing whitespace characters - Facebook api

comdex
April 21, 2017
247 views
1 vote
2 Answers

I’m working on a script that builds an XML feed using strings from the database. The strings are user-entered image captions from Facebook Open Graph API. The strings are supposed to be all UTF8 according to facebook. So i import the captions into the database and store them as utf8-unicode (i also tried utf8-bin)

But i always have the same error when trying to display the output XML feed, because one of the caption have a weird whitespace character

This page contains the following errors:

error on line 63466 at column 14: Input is not proper UTF-8, indicate encoding !
Bytes: 0x0B 0x54 0x68 0x6F
Below is a rendering of the page up to the first error.

In the database (phpmyadmin) and in the page source code (using chrome), the problematic characters appear as empty square symbol.
Now if i copy and paste the problematic character in an converter it gives me Hexadecimal 000B

What’s the easiest way to fix this ?
I’d also like to understand in the first place, why Facebook Graph API is giving me non-utf8 characters when it’s not supposed to

Failed attemps:

utf8_encode() isn’t working because the rest of the strings are UTF8 valid.
I also tried multiple different ways of stripping out all non-utf8 characters, but it doesn’t filter out this specific character. Same when trying to filter out all non-latin.
htmlentities() htmlspecialchars() or the same isn’t encoding the problematic characters
charactericonv(mb_detect_encoding()) will not detect the string as invalid utf8
str_replace() or preg_replace() is of no help, if i try to copy and paste the character in Visual Studio Code, nothing is pasted, not even a whitespace
str_replace(“”, “”, ) …nope

Answers

- Pyromonk
- April 22, 2017 at 2:45 am
- 0 votes
0
Here is a list of what we have found and/or worked through with the original poster:
We have checked the above and discovered that the initial problem was caused by vertical tabulation symbols creeping into the text fields. A good way to remove said symbols is by running $str = str_replace("x0b", "", $str);, where $str is the string that is going to be inserted into the text field. It’s important to not replace v, as that might not be desired.
Login or Signup to reply.

- RickJames
- April 23, 2017 at 6:45 pm
- 0 votes
0
If the 0B is always at the beginning of a string, then trace the strings back to their source and see if they are “BOM” encoded. Wikipedia on BOM .

At least come back with the various steps the data takes, so we can help with deducing the source of the problem.

Note: although needed for Emoji and Chinese, switching to utf8mb4 will not deal with BOM if that is the ‘real’ problem.

(using str_replace is just a bandaid)

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

UTF8 null character & normalizing whitespace characters – Facebook api

Answers