skip to Main Content

I am experiencing some issue with MySQL encoding parsing data and then storing data from the Twitter API.

The tweet that struggles to get stored into the db is:

   INSERT INTO `statuses` (`status_id`,`text`) VALUES('93332222111111','The beers are on me in this case!�')

The character is this one.
whereas the following got stored successfully:

INSERT INTO `statuses` (`status_id`,`text`) VALUES('485072105225921','RT @someone: 🔥 Don't forget to index timestamp columns like "created_at" if you query against them.nne.g.: ORDER BY created_atne.g.: WH')

Let’s have a look into the character set:

SHOW VARIABLES LIKE 'character_set%'

which brings back

character_set_client    utf8mb4
character_set_connection    utf8mb4
character_set_database  utf8mb4
character_set_filesystem    binary
character_set_results   utf8mb4
character_set_server    utf8
character_set_system    utf8
character_sets_dir  /usr/local/Cellar/mysql/5.7.18_1/share/mysql/charsets/

Anything obvious that I am missing?

Update:
Also the following if block:

if utf8.ValidString(strings.Join(values, ",")) == false {
            fmt.Println(strings.Join(values, ","))
        }

returns:

'The beers are on me in this case!�','943304851980963841' 

2

Answers


  1. Two suggestions possible to solve your problem:

    • Use UTF16 charset;
    • Use utf8mb4 as char set and utf8mb4_unicode_ci as collation.

    You can use the following code as an example, extracted from an online tutorial:

    # For each database:
    ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
    # For each table:
    ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
    # For each column:
    ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
    

    As your problem is not with the database, you have to use the right code representing the image. I suggest you use “emoji-java”: a lightweight java library that helps you use Emojis in your java applications.

    An example:

    String str = "An :grinning:awesome :smiley:string 😄with a few :wink:emojis!";
    String result = EmojiParser.parseToUnicode(str);
    System.out.println(result);
    // Prints:
    // "An 😀awesome 😃string 😄with a few 😉emojis!"
    

    Another edit:
    You just tell now the language you are using: Go. In this case, you can take a look at Go-emoji here. Or even this another emoji project.

    Login or Signup to reply.
  2. Do not use utf16 for anything.

    Use MySQL’s CHARACTER SET utf8mb4; it is equivalent to the outside world’s UTF-8, and includes the characters that begin with hex F0. (MySQL’s utf8 does not include them.)

    🔥, hex F09F94A5 needs utf8mb4. Ditto for 🍻, hex F09F8DBB.

    indicates that something is already messed up. That is, the INSERT was given a bad character.

    Please provide SELECT HEX(text) ... to see if you get one of those hex values.

    Also see “black diamond” in Trouble with UTF-8 characters; what I see is not what I stored

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search