I am using
mysql Ver 15.1 Distrib 10.3.39-MariaDB, for Linux (x86_64) using readline 5.1
PHP 8.2.13 (cli) (built: Nov 24 2023 09:33:30) (NTS)
I would like to use either or both of these to find data within mysql where the mysql data columns have been declared to be in character set utf8mb4, collation utf8mb4_unicode_ci, but where the characters in the data are not actually valid UTF-8 characters.
I’m pretty sure that I have Mojobake, latin1, or other non-valid characters in my data. My data comes from governmental databases that do not really screen what reporters to the database enter and do not have a consistent character encoding. Some of the characters certainly look like garbage characters.
However, I’ve tried various ways of automatically detecting these, and none of them seem to work.
I’ve tried within mysql (I’m pretty sure that all of the defaults e.g. SET NAMES are set correctly):
select field_name from table_name
WHERE CONVERT(field_name, binary) RLIKE '([\xC0-\xC1]|[\xF5-\xFF]|\xE0[\x80-\x9F]|\xF0[\x80-\x8F]|[\xC2-\xDF](?![\x80-\xBF])|[\xE0-\xEF](?![\x80-\xBF]{2})|[\xF0-\xF4](?![\x80-\xBF]{3})|(?<=[\x00-\x7F\xF5-\xFF])[\x80-\xBF]|(?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF]|(?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF])|(?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2})|(?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]))'
I’ve tried within PHP retrieving the mysql data into $string (I’m pretty sure that I’ve set all the PHP defaults to use UTF-8) and then:
1.
if (preg_match("//u", $string)) {
// $string is valid UTF-8
}
if (mb_detect_encoding($string,"UTF-8",TRUE)) {
// $string is valid UTF-8
}
But none of these methods detect the characters that look wrong.
Sample data
DROP TABLE IF EXISTS sample;
CREATE TABLE sample ( data text DEFAULT NULL )
ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
INSERT INTO sample VALUES (
'Mr. Geoffrey H. Yost, O�Melveny & Myers LLP')
,('Valero Refining Company � California')
,('El Paso Merchant Energy � Petroleum Company')
,('Papé Group/IFCO')
,('Huntons� SureCrop Farm Svc Inc dba SureCrop Farm')
,('developed, manufactured, marketed, tested, and sold the electronic diesel control that allowed Mercedes to manipulate emissions controls and that Bosch marketed ¢€š¬Ã…¡¬ÃƒÆ’‚¬¦¡€š¬Ã…¡¬Ãƒ€ ‚¬„¢‚¬Å¡¬¦¢€š¬Ã…¡¬ÃƒÆ’‚¬¦€š¬Ã…€œClean Diesel¢€š¬Ã…¡¬ÃƒÆ’‚¬¦¡€š¬Ã…¡¬€š¬Ã…¡<9d> to the public.');
2
Answers
The original data was stored in utf8 or utf8mb4 but incorrectly decoded using an 8-bit encoding, like latin1.
'�'
was unicode replacement character�
which usually indicates that some other character could not correctly be converted before.You can easily check this with e.g. python
Currently the utf8mb4 character set contains 1,112,064 valid characters, but an 8-bit character set only contains 255 characters. This means that your data theoretically might have 1.111.809 wrong multibyte characters, which weren’t converted correctly.
I did load your sample data, after that I did, base on this answer:
This produces:
It seems that the import you did from the external databases incorrectly imported LATIN1 as UTF8.
Your statement: "… governmental databases that do not really screen what reporters to the database enter …" might be incorrect, and it might be that you simply imported it using wrong assumtpions.
A solution, to find those errors, might be:
But this leads to the error:
So, in stead of correcting the error, you might want to restart, and do the import in a correct way?
Also try this:
The first statement does not throw a warning, but the second statement does. The warning is:
Invalid utf8mb4 character string: 'A2809A'
The difference between those 2 last statement, and the warning give, might lead to an almost not possible way of finding this wrong characters.
I created a DBFIDDLE with the statements used.