skip to Main Content

I am using

mysql Ver 15.1 Distrib 10.3.39-MariaDB, for Linux (x86_64) using readline 5.1

PHP 8.2.13 (cli) (built: Nov 24 2023 09:33:30) (NTS)

I would like to use either or both of these to find data within mysql where the mysql data columns have been declared to be in character set utf8mb4, collation utf8mb4_unicode_ci, but where the characters in the data are not actually valid UTF-8 characters.

I’m pretty sure that I have Mojobake, latin1, or other non-valid characters in my data. My data comes from governmental databases that do not really screen what reporters to the database enter and do not have a consistent character encoding. Some of the characters certainly look like garbage characters.

However, I’ve tried various ways of automatically detecting these, and none of them seem to work.

I’ve tried within mysql (I’m pretty sure that all of the defaults e.g. SET NAMES are set correctly):

select field_name from table_name
WHERE CONVERT(field_name, binary) RLIKE '([\xC0-\xC1]|[\xF5-\xFF]|\xE0[\x80-\x9F]|\xF0[\x80-\x8F]|[\xC2-\xDF](?![\x80-\xBF])|[\xE0-\xEF](?![\x80-\xBF]{2})|[\xF0-\xF4](?![\x80-\xBF]{3})|(?<=[\x00-\x7F\xF5-\xFF])[\x80-\xBF]|(?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF]|(?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF])|(?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2})|(?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]))'

I’ve tried within PHP retrieving the mysql data into $string (I’m pretty sure that I’ve set all the PHP defaults to use UTF-8) and then:

1.

if (preg_match("//u", $string)) {
  // $string is valid UTF-8
}
if (mb_detect_encoding($string,"UTF-8",TRUE)) {
  // $string is valid UTF-8
}

But none of these methods detect the characters that look wrong.

Sample data

 DROP TABLE IF EXISTS sample; 
 CREATE TABLE sample ( data text DEFAULT NULL ) 
 ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; 
 INSERT INTO sample VALUES (
 'Mr. Geoffrey H. Yost, O�Melveny & Myers LLP')
 ,('Valero Refining Company � California')
 ,('El Paso Merchant Energy � Petroleum Company')
 ,('Papé Group/IFCO')
 ,('Huntons� SureCrop Farm Svc Inc dba SureCrop Farm')
 ,('developed, manufactured, marketed, tested, and sold the electronic diesel control that allowed Mercedes to manipulate emissions controls and that Bosch marketed ¢€š¬Ã…¡¬ÃƒÆ’‚¬¦¡€š¬Ã…¡¬Ãƒ€ ‚¬„¢‚¬Å¡¬¦¢€š¬Ã…¡¬ÃƒÆ’‚¬¦€š¬Ã…€œClean Diesel¢€š¬Ã…¡¬ÃƒÆ’‚¬¦¡€š¬Ã…¡¬€š¬Ã…¡<9d> to the public.'); 

2

Answers


  1. The original data was stored in utf8 or utf8mb4 but incorrectly decoded using an 8-bit encoding, like latin1.

    '�' was unicode replacement character which usually indicates that some other character could not correctly be converted before.

    You can easily check this with e.g. python

    >>> 'é'.encode('raw_unicode_escape').decode('utf8')
    'é'
    >>> '�'.encode('raw_unicode_escape').decode('utf8')
    '�'
    

    Currently the utf8mb4 character set contains 1,112,064 valid characters, but an 8-bit character set only contains 255 characters. This means that your data theoretically might have 1.111.809 wrong multibyte characters, which weren’t converted correctly.

    Login or Signup to reply.
  2. I did load your sample data, after that I did, base on this answer:

    -- Because I do not like selecting text with weird characters:
    alter table sample add id INT AUTO_INCREMENT PRIMARY KEY;
    
    select 
       id, 
       data, 
       convert(cast(convert(data using  latin1) as binary) using utf8mb4) L 
    from sample 
    where id=4;
    

    This produces:

    id data L
    4 Papé Group/IFCO Papé Group/IFCO

    It seems that the import you did from the external databases incorrectly imported LATIN1 as UTF8.

    Your statement: "… governmental databases that do not really screen what reporters to the database enter …" might be incorrect, and it might be that you simply imported it using wrong assumtpions.

    A solution, to find those errors, might be:

    select 
      data d1,
      convert(cast(convert(data using  latin1) as binary) using utf8mb4) d2 
    from sample 
    where data<>convert(cast(convert(data using  latin1) as binary) using utf8mb4);
    

    But this leads to the error:

    ERROR 1267 (HY000): Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8mb4_0900_ai_ci,IMPLICIT) for operation ‘<>’

    So, in stead of correcting the error, you might want to restart, and do the import in a correct way?

    Also try this:

    select id, data, convert(cast(convert(data using  latin1) as binary) using utf8mb4) as d2 
    from sample 
    where data collate utf8mb4_0900_ai_ci != convert(cast(convert(data using  latin1) as binary) using utf8mb4) and id between 5 and 5;
    
    select id, data, convert(cast(convert(data using  latin1) as binary) using utf8mb4) as d2 
    from sample 
    where data collate utf8mb4_0900_ai_ci != convert(cast(convert(data using  latin1) as binary) using utf8mb4) and id >= 5;
    

    The first statement does not throw a warning, but the second statement does. The warning is: Invalid utf8mb4 character string: 'A2809A'

    The difference between those 2 last statement, and the warning give, might lead to an almost not possible way of finding this wrong characters.

    I created a DBFIDDLE with the statements used.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search