skip to Main Content

I have a very annoying issue when transferring data from PHP (which is set up with default_charset = 'UTF-8') to MySQL, on an IIS environment with web.config set to:

<globalization requestEncoding="UTF-8" responseEncoding="UTF-8" />

Also tested on an Apache Environment with same default charset with same results. I also set MySQLi to set charset before any query:

$mysqli->set_charset('UTF-8');

I also have a set of prepared statements that search each character in a string to convert any none UTF-8 encoding to UTF-8 using:

mb_convert_encoding($char, 'UTF-8', mb_detect_encoding($char));

After trying and failing by parsing the whole string with the same process:

mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));

The content type of the file with the script is set to content-type charset UTF-8 on the header. And still none of this has fixed the issue.

I also, have set all collations of the MySQL tables and schemas to utf8mb4, and of course using `utf8mb4_unicode_520_ci’. Still, issue persists.

The function I wrote to split the string is as follows:

    static private function split_and_convert($value, $encoding = 'UTF-8') {
        $split = mb_str_split($value);
        $filter = [];
        foreach($split as $chr) {
            $from = mb_detect_encoding($chr);
            $encoded = ($from !== $encoding)
                ? mb_convert_encoding($chr, $encoding, $from)
                : $chr;
            $filter[] = $encoded;
            // echoing for testing
            echo 'char ' . $encoded . ' (' . $from . ')<br>';
        }
        return join('', $filter);
    }

Which echo the following trying to save Martínez:

char M (ASCII)
char a (ASCII)
char r (ASCII)
char t (ASCII)
char � (UTF-8)
char n (ASCII)
char e (ASCII)
char z (ASCII)

With MySQL throwing the following error:

Error No: 1366 - Incorrect string value: 'xEDnez' for column 'contactLast' at row 1

Still no joy. But if I do this, it works like a charm:

    static private function split_and_convert($value, $encoding = 'UTF-8') {
        $split = mb_str_split($value);
        $filter = [];
        foreach($split as $chr) {
            $from = mb_detect_encoding($chr);
            $encoded = ($from !== $encoding)
                ? mb_convert_encoding($chr, $encoding, $from)
                // set to actually encode the 'UTF-8' encoded char
                // I'm supposed not to do this, but is what works
                // which makes no sense
                : utf8_encode($chr);
            $filter[] = $encoded;
            // echoing for testing
            echo 'char ' . $encoded . ' (' . $from . ')<br>';
        }
        return join('', $filter);
    }

With the following results:

char M (ASCII)
char a (ASCII)
char r (ASCII)
char t (ASCII)
char í (UTF-8)
char n (ASCII)
char e (ASCII)
char z (ASCII)

And MySQL not throwing any error. If I replace the utf8_encode with mb_convert_encoding($encoded, 'UTF-8'), I get the same errors as not filtering the string.

I’m suspecting mb_detect_encoding is either detecting erroneously the encoding of the characters or simply the mb_convert_encoding is not doing it’s job.

If anyone can help I will greatly appreciate it. I’m been searching for hours for a solution and none of the solutions offered have worked. And I find myself in a pickle because what works has been deprecated.

So, these are the list of the issues I’m facing:

  1. mb_convert_encoding -> not encoding or encoding to something other than UTF8
  2. mb_detect_encoding -> not detecting correctly
  3. Strings is encoded as ASCII not UTF-8 as charset in header requested.
  4. Confusion: Why I need to encode to UTF-8 an UTF-8 encoded character to work?

2

Answers


  1. Chosen as BEST ANSWER

    After looking for answers and long day of research this was the issue. Over processing. At some point during development a middleman was put in place to verify each POST/GET request and process each value to ensure the encoding sent was UTF-8. The code I'm refactoring already took care of processing the data.

    The middleman was coded using utf8_encode, so, I'm suspecting the function check itself to avoid over processing. I have to refactor the middleman as well so there's no conflict.

    So, that was the issue, two different codes were processing the encoding creating the conflict. As soon I refactor the middleman the issue desisted.

    In conclusion, the problem is not mb_convert_encoding or mb_detect_encoding, if you have a similar issue, make sure the code you are working on is not using two different methods doing similar processes with different functions.

    In my case, the middleman was not verifying or set to detect if the string was already encoded to UTF-8, it was just implementing the encoding regardless. And the reason I've been paid to refactor the code.

    Still what does not make sense to me is why a properly UTF-8 encoded string caused MySQL to throw an error. For that I suspect the over processing was corrupting the input.


  2. ED is the latin1 encoding for í. You should find what is encoding in latin1, then either change it to use UTF-8 (aka MySQL’s CHARACTER SET = utf8mb4) or tell MySQL that your data is in CHARACTER SET latin1 and let it convert as it stores (INSERTs) and retrieves (SELECTs) it.

    Do not use any conversion routines (mb_*), that tends to make the mess worse.

    The UTF-8 (utf8mb4) encoding for acute-i is C3AD.

    For more discussion, see "black diamond" in Trouble with UTF-8 characters; what I see is not what I stored

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search