skip to Main Content

I come back with utf8mb4 character issues. I’m using mariadb Server version: 10.3.37-MariaDB-0ubuntu0.20.04.1 Ubuntu 20.04 on a local machine with Linux Mint 20.3.
Here is a test table:

CREATE TABLE chartable (
    k1 VARCHAR(5)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

INSERT INTO chartable (k1) values
    ('食𣅀C' ),
    ('食B𡃀'),
    ('亰BC');

Now the issues:

MariaDB [test]> select * from chartable where k1 like '%𣅀%';
+----------+
| k1       |
+----------+
| 食𣅀C    |
| 食B𡃀    |
+----------+
2 rows in set (0,008 sec)
MariaDB [test]> select hex('𣅀');
+----------+
| hex('?') |
+----------+
| F0A38580 |
+----------+

1 row in set (0,001 sec)

I notice the first query doesn’t return what I expected, which is 食𣅀C
Second query displays a ? instead of 𣅀
I ensured that character set and collations are consistent, database and table have the same character set and collation, utf8mb4_general_ci as I was told in this thread https://stackoverflow.com/questions/74975518/false-duplicate-with-mariadb-and-chinese-characters/74975829#74975829

MariaDB [nihongo]> show variables like 'character%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | utf8mb4                    |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

and:

MariaDB [nihongo]> show variables like 'collation%';
+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | utf8mb4_general_ci |
| collation_database   | utf8mb4_general_ci |
| collation_server     | utf8mb4_general_ci |
+----------------------+--------------------+

I’ve tested with others tables and got the same type of issues.
Could you explain me what I’m doing wrong and what to do to use correctly LIKE with mariadb and chinese characters? Thank you in advance

2

Answers


  1. Chosen as BEST ANSWER

    Thank you for the response. The problem is I've changed from utf8mb4_unicode_ci to utf8mb4_general_ci because similar issues occured with utf8mb4_unicode_ci, as I was told in this thread.

    Let's add the two following characters in the chartable:

    INSERT INTO chartable VALUES ('⺌'), ('⺍');
    

    Then I test SELECT with the two collations

    MariaDB [test]> select * from chartable  where k1='⺌' collate utf8mb4_unicode_ci;
    +------+
    | k1   |
    +------+
    | ⺌   |
    | ⺍   |
    +------+
    
    MariaDB [test]> select * from chartable  where k1='⺌' collate utf8mb4_general_ci;
    +------+
    | k1   |
    +------+
    | ⺌   |
    +------+
    

    Here I notice the right answer comes with utf8mb4_general_ci collation, while utf8mb4_unicode_ci doesn't return the right answer. The two collations seem each to be appropriate for some unicode characters, and not for others. How to manage this?


  2. utf8mb4_general_ci is a simplified faster sort/comparison implementation . It does not follow the Unicode rules and might result in undesirable sorting or comparison.

    Instead of you should use utf8mb4_unicode_ci, which is based on the official unicode rules for ,sorting and comparison.

    select * from chartable where k1 like '%𣅀%' collate utf8mb4_unicode_ci;
    

    hex('%𣅀%') is displayed as hex('?') since the use of unicode characters in identifier names is limited to characters < +U10000 (see also https://mariadb.com/kb/en/identifier-names/).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search