skip to Main Content

I’m trying to write an query that will analyze the data stored in Primary Keys across multiple tables to find if they contain unicode characters in them.

Below are the schema of my tables:

 mysql> SHOW CREATE TABLE employee_plain;
+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table          | Create Table                                                                                                                                                                                                                                                                |
+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| employee_plain | CREATE TABLE `employee_plain` (
  `emp_id` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  `emp_name` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
  `age` int(3) DEFAULT NULL,
  PRIMARY KEY (`emp_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> SHOW CREATE TABLE employee_unicode;
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table            | Create Table                                                                                                                                                                                                                                                                  |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| employee_unicode | CREATE TABLE `employee_unicode` (
  `emp_id` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  `emp_name` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
  `age` int(3) DEFAULT NULL,
  PRIMARY KEY (`emp_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Below is the data store on both table. The table employee_unicode contains unicode values in the PRIMARY KEY column:

mysql> select * from employee_plain;
+------------------------------+----------+------+
| emp_id                       | emp_name | age  |
+------------------------------+----------+------+
| asdasd123                    | abcsd    |   12 |
| fsoiuioujvsdf4               | abvkd    |   13 |
| sdfgjshgjshdfljsfklju4532489 | sdfsdff  |   11 |
+------------------------------+----------+------+
3 rows in set (0.00 sec)

mysql> select * from employee_unicode;
+--------------------------------------------------------------+----------+------+
| emp_id                                                       | emp_name | age  |
+--------------------------------------------------------------+----------+------+
| A ΠΛΦΟΙΚ ΑΕ#1420000000000000000                              | sdfsf    |   11 |
| sdfsdfsf234                                                  | fsdfsd   |   12 |
| ΑΣΕΛ - ΑΦΟΙ. ΣΕΛΙΔΗ Α.Ε.#000000000000000                     | sdfsd    |   13 |
| ΦΩΤΗΣ#10000000000                                            | sdfsdfd  |   14 |
+--------------------------------------------------------------+----------+------+
4 rows in set (0.00 sec)

I tried various queries using ASCII, BINARY and REGEX:

mysql> SELECT
    ->   TABLE_NAME,
    ->   COLUMN_NAME,
    ->   COLUMN_TYPE,
    ->   IF( COLUMN_NAME REGEXP '[^x00-x7F]', 'Contains Unicode', 'No Unicode') AS Unicode_validation
    -> FROM
    ->   information_schema.columns
    -> WHERE
    ->   table_schema = 'amv_testdb' AND
    ->   COLUMN_KEY = 'PRI'
    -> ORDER BY
    ->   TABLE_NAME,
    ->   ORDINAL_POSITION;
+------------------+-------------+--------------+--------------------+
| TABLE_NAME       | COLUMN_NAME | COLUMN_TYPE  | Unicode_validation |
+------------------+-------------+--------------+--------------------+
| employee_plain   | emp_id      | varchar(100) | No Unicode         |
| employee_unicode | emp_id      | varchar(100) | No Unicode         |
+------------------+-------------+--------------+--------------------+
2 rows in set (0.00 sec)

mysql> SELECT
    ->   TABLE_NAME,
    ->   COLUMN_NAME,
    ->   COLUMN_TYPE,
    ->   IF( COLUMN_NAME <> CONVERT( COLUMN_NAME USING ASCII), 'No Unicode', 'Contains Unicode') AS Unicode_validation
    -> FROM
    ->   information_schema.columns
    -> WHERE
    ->   table_schema = 'amv_testdb' AND
    ->   COLUMN_KEY = 'PRI'
    -> ORDER BY
    ->   TABLE_NAME,
    ->   ORDINAL_POSITION;
+------------------+-------------+--------------+--------------------+
| TABLE_NAME       | COLUMN_NAME | COLUMN_TYPE  | Unicode_validation |
+------------------+-------------+--------------+--------------------+
| employee_plain   | emp_id      | varchar(100) | Contains Unicode   |
| employee_unicode | emp_id      | varchar(100) | Contains Unicode   |
+------------------+-------------+--------------+--------------------+
2 rows in set (0.00 sec)

mysql> SELECT
    ->   TABLE_NAME,
    ->   COLUMN_NAME,
    ->   COLUMN_TYPE,
    ->   IF(CONVERT(COLUMN_NAME USING BINARY) <> COLUMN_NAME, 'Contains Unicode', 'No Unicode') AS Unicode_validation
    -> FROM
    ->   information_schema.columns
    -> WHERE
    ->   table_schema = 'amv_testdb' AND
    ->   COLUMN_KEY = 'PRI'
    -> ORDER BY
    ->   TABLE_NAME,
    ->   ORDINAL_POSITION;
+------------------+-------------+--------------+--------------------+
| TABLE_NAME       | COLUMN_NAME | COLUMN_TYPE  | Unicode_validation |
+------------------+-------------+--------------+--------------------+
| employee_plain   | emp_id      | varchar(100) | No Unicode         |
| employee_unicode | emp_id      | varchar(100) | No Unicode         |
+------------------+-------------+--------------+--------------------+
2 rows in set (0.00 sec)

Please help me understand why I get incorrect results.

2

Answers


  1. If looking for data that contains non-ASCII characters, then compare the stored value to a conversion of that value using ASCII e.g. WHERE emp_id <> CONVERT(emp_id USING ASCII)

    CREATE TABLE employees (
        emp_id VARCHAR(255),
        emp_name VARCHAR(255),
        age INT
    );
    
    INSERT INTO employees (emp_id, emp_name, age)
    VALUES
        ('A ΠΛΦΟΙΚ ΑΕ#1420000000000000000', 'sdfsf', 11),
        ('sdfsdfsf234', 'fsdfsd', 12),
        ('ΑΣΕΛ - ΑΦΟΙ. ΣΕΛΙΔΗ Α.Ε.#000000000000000', 'sdfsd', 13),
        ('ΦΩΤΗΣ#10000000000', 'sdfsdfd', 14);
    

    SELECT *
    FROM employees
    WHERE emp_id <> CONVERT(emp_id USING ASCII)
    
    emp_id emp_name age
    A ΠΛΦΟΙΚ ΑΕ#1420000000000000000 sdfsf 11
    ΑΣΕΛ – ΑΦΟΙ. ΣΕΛΙΔΗ Α.Ε.#000000000000000 sdfsd 13
    ΦΩΤΗΣ#10000000000 sdfsdfd 14

    fiddle

    Login or Signup to reply.
  2. WHERE HEX(emp_id) REGEXP '^(..)*[CDEFcdef]'
    

    will spot any emp_id that has at least one UTF-8 character. (That is, CHARACTER SET utf8 or utf8mb4.)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search