skip to Main Content

I have discovered that the reason I am getting weirdness with my strings being written to MySQL (5.6) from Kotlin is that, occasionally, they are being double encoded. I’ve verified this by printing the hex() of the column of two identical strings– here is an example where an API returned the identical string, with the identical jdbcurl parameters, identical everything, ten minutes apart:

Great job getting your protein and fruit to help balance your blood sugar 🍓🍇🍎

4772656174206A6F622067657474696E6720796F75722070726F7465696E20616E6420667275697420746F2068656C702062616C616E636520796F757220626C6F6F6420737567617220F09F8D93F09F8D87F09F8D8E

Great job getting your protein and fruit to help balance your blood sugar ðŸ“ðŸ‡ðŸŽ

4772656174206A6F622067657474696E6720796F75722070726F7465696E20616E6420667275697420746F2068656C702062616C616E636520796F757220626C6F6F6420737567617220C3B0C5B8C28DE2809CC3B0C5B8C28DE280A1C3B0C5B8C28DC5BD

For comparison, here’s the end of each string where the encoding differs:

F09F8D93F09F8D87F09F8D8E
C3B0C5B8C28DE2809CC3B0C5B8C28DE280A1C3B0C5B8C28DC5BD

Is this double encoding? It looks like it’s being encoded in utf-8 then decoded in cp1250 or latin1 then encoded again in utf-8. I mean, it seems like that’s the case but I can’t honestly figure out what else it could be.

I’m kind of at wits end here. I’ve read through all of @Rick James great info on this. I have my table columns properly set with utf8mb4 and my jdbcurl ends with:

&useUnicode=true&characterEncoding=UTF-8&sessionVariables=character_set_server=utf8mb4

The variables set on the MySQL server are:

character_set_client = utf8mb4
character_set_connection = utf8mb4
character_set_database = latin1
character_set_filesystem = binary
character_set_results = utf8mb4
character_set_server = latin1
character_set_system = utf8

I really have no idea why it works and then just… doesn’t. And generally when it doesn’t work, it’s like some setting gets changed and sticks because after one starts getting mangled, they all start getting mangled after wards.

This is a database that has multiple connections to it at once from different clients and code bases– so I’m wondering if there is something causing a double connection to double encode these, but I have no clue at this point. I’ve verified that the payload is handled properly in Kotlin but can’t pinpoint where the double encoding is happening, and why.

Hopefully somebody has some info on this because it’s driving me insane! I should mention our older PHP code base is just fine and has no issues reading/writing the same exact payload from the same exact API which makes me think it’s something in the Kotlin/Java jdbc driver potentially.

My original post from this weekend is here:
Issues writing utf8 strings with emojis from Kotlin to MySQL- utf8 vs utf8mb4

This lead me to check the byte structure.

EDIT:

I am adding this because it could explain the weird behavior I’m seeing. I am using the jdbc connector version 8 (8.0.25) from Kotlin, which according the documentation only officially supports MySQL 5.7 and 8.

https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-versions.html

We are rocking MySql 5.6– so I’m wondering if this could be causing my pain.

EDIT2:

SHOW CREATE TABLE output:

CREATE TABLE `tablename` (
  `c1` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `c2` int(11) NOT NULL,
  `c3` varchar(1000) NOT NULL DEFAULT '',
  `c4` smallint(6) NOT NULL DEFAULT '3',
  `c5` smallint(6) NOT NULL DEFAULT '0',
  `created` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `modified` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=492331 DEFAULT CHARSET=utf8mb4

c3 is the column in question.

I know latin1 is the character_set_server and that the jdbc connection says for mysql 5.6 you have to set that to utf8mb4 or it will not work– and I am setting that in the sessionVariables but is not only working sometimes? Does it need to be set on the server itself as the default with a restart of the database?

@RickJames the two dumps I have up above are from doing a HEX() on two different rows of the same exact table, where I am attempting to write the exact same thing from Kotlin yet one works and the other seems to get double encoded.

2

Answers


  1. Chosen as BEST ANSWER

    Thanks again to @Rick James for his help. The thing that did it for my particular case, where we have JDBC connector 8+ trying to talk to a MySQL database 5.6.*, was that you have to set the database variable

    character_set_server=utf8mb4

    on the server itself, and restart the DB. Setting this as a sessionVariable in the jdbcurl has very unstable behavior, and the default of latin1 kept rearing it's ugly head in completely random ways. So if you are having this issue, make sure that variable is set statically on your database if you can and it seems to have fixed it for me.

    In the jdbc litrature here: https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-charsets.html

    it specifically says:

    For Connector/J 8.0.12 and earlier: In order to use the utf8mb4 character set for the connection, the server MUST be configured with character_set_server=utf8mb4; if that is not the case, when UTF-8 is used for characterEncoding in the connection string, it will map to the MySQL character set name utf8, which is an alias for utf8mb3.


  2. Yes, that looks like "double encoding".

    See Trouble with UTF-8 characters; what I see is not what I stored

    For cp1250, I would expect 🍓🍇🍎

    For latin1, I would expect ðŸ“ðŸ‡ðŸŽ

    Please provide SHOW CREATE TABLE

    I hope you are not doing any explicit encode or decode function calls in the application. That confuses the issue.

    Those Emoji need utf8mb4, not utf8; all discussion here should apply to 5.6, 5.7, and 8.0 (and even 5.5).

    I hope there is a "8" on the end of what you have in your Question:

    &useUnicode=true&characterEncoding=UTF-
    

    (I now notice that Kotlin was missing a dash: in useUnicode=True&characterEncoding=utf8 cf https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-charsets.html )

    If SELECT HEX(...) is giving ...F09F8D93..., the table is ‘good’. (F0, but itself is usually a sufficient indicator of correct utf8mb4 data.)

    If SELECT HEX(...) is giving ...C3B0... (any hex char Cx or Dx) may indicate Mojibake.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search