I have discovered that the reason I am getting weirdness with my strings being written to MySQL (5.6) from Kotlin is that, occasionally, they are being double encoded. I’ve verified this by printing the hex() of the column of two identical strings– here is an example where an API returned the identical string, with the identical jdbcurl parameters, identical everything, ten minutes apart:
Great job getting your protein and fruit to help balance your blood sugar 🍓🍇🍎
4772656174206A6F622067657474696E6720796F75722070726F7465696E20616E6420667275697420746F2068656C702062616C616E636520796F757220626C6F6F6420737567617220F09F8D93F09F8D87F09F8D8E
Great job getting your protein and fruit to help balance your blood sugar ðŸ“ðŸ‡ðŸŽ
4772656174206A6F622067657474696E6720796F75722070726F7465696E20616E6420667275697420746F2068656C702062616C616E636520796F757220626C6F6F6420737567617220C3B0C5B8C28DE2809CC3B0C5B8C28DE280A1C3B0C5B8C28DC5BD
For comparison, here’s the end of each string where the encoding differs:
F09F8D93F09F8D87F09F8D8E
C3B0C5B8C28DE2809CC3B0C5B8C28DE280A1C3B0C5B8C28DC5BD
Is this double encoding? It looks like it’s being encoded in utf-8 then decoded in cp1250 or latin1 then encoded again in utf-8. I mean, it seems like that’s the case but I can’t honestly figure out what else it could be.
I’m kind of at wits end here. I’ve read through all of @Rick James great info on this. I have my table columns properly set with utf8mb4
and my jdbcurl ends with:
&useUnicode=true&characterEncoding=UTF-8&sessionVariables=character_set_server=utf8mb4
The variables set on the MySQL server are:
character_set_client = utf8mb4
character_set_connection = utf8mb4
character_set_database = latin1
character_set_filesystem = binary
character_set_results = utf8mb4
character_set_server = latin1
character_set_system = utf8
I really have no idea why it works and then just… doesn’t. And generally when it doesn’t work, it’s like some setting gets changed and sticks because after one starts getting mangled, they all start getting mangled after wards.
This is a database that has multiple connections to it at once from different clients and code bases– so I’m wondering if there is something causing a double connection to double encode these, but I have no clue at this point. I’ve verified that the payload is handled properly in Kotlin but can’t pinpoint where the double encoding is happening, and why.
Hopefully somebody has some info on this because it’s driving me insane! I should mention our older PHP code base is just fine and has no issues reading/writing the same exact payload from the same exact API which makes me think it’s something in the Kotlin/Java jdbc driver potentially.
My original post from this weekend is here:
Issues writing utf8 strings with emojis from Kotlin to MySQL- utf8 vs utf8mb4
This lead me to check the byte structure.
EDIT:
I am adding this because it could explain the weird behavior I’m seeing. I am using the jdbc connector version 8 (8.0.25) from Kotlin, which according the documentation only officially supports MySQL 5.7 and 8.
https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-versions.html
We are rocking MySql 5.6– so I’m wondering if this could be causing my pain.
EDIT2:
SHOW CREATE TABLE output:
CREATE TABLE `tablename` (
`c1` int(11) unsigned NOT NULL AUTO_INCREMENT,
`c2` int(11) NOT NULL,
`c3` varchar(1000) NOT NULL DEFAULT '',
`c4` smallint(6) NOT NULL DEFAULT '3',
`c5` smallint(6) NOT NULL DEFAULT '0',
`created` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`modified` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=492331 DEFAULT CHARSET=utf8mb4
c3 is the column in question.
I know latin1 is the character_set_server
and that the jdbc connection says for mysql 5.6 you have to set that to utf8mb4 or it will not work– and I am setting that in the sessionVariables but is not only working sometimes? Does it need to be set on the server itself as the default with a restart of the database?
@RickJames the two dumps I have up above are from doing a HEX()
on two different rows of the same exact table, where I am attempting to write the exact same thing from Kotlin yet one works and the other seems to get double encoded.
2
Answers
Thanks again to @Rick James for his help. The thing that did it for my particular case, where we have JDBC connector 8+ trying to talk to a MySQL database 5.6.*, was that you have to set the database variable
character_set_server=utf8mb4
on the server itself, and restart the DB. Setting this as a sessionVariable in the jdbcurl has very unstable behavior, and the default of
latin1
kept rearing it's ugly head in completely random ways. So if you are having this issue, make sure that variable is set statically on your database if you can and it seems to have fixed it for me.In the jdbc litrature here: https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-charsets.html
it specifically says:
Yes, that looks like "double encoding".
See Trouble with UTF-8 characters; what I see is not what I stored
For cp1250, I would expect
🍓🍇🍎
For latin1, I would expect
ðŸ“ðŸ‡ðŸŽ
Please provide
SHOW CREATE TABLE
I hope you are not doing any explicit encode or decode function calls in the application. That confuses the issue.
Those Emoji need
utf8mb4
, notutf8
; all discussion here should apply to 5.6, 5.7, and 8.0 (and even 5.5).I hope there is a "8" on the end of what you have in your Question:
(I now notice that Kotlin was missing a dash: in
useUnicode=True&characterEncoding=utf8
cf https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-charsets.html )If
SELECT HEX(...)
is giving...F09F8D93...
, the table is ‘good’. (F0
, but itself is usually a sufficient indicator of correct utf8mb4 data.)If
SELECT HEX(...)
is giving...C3B0...
(any hex char Cx or Dx) may indicate Mojibake.