I’m trying to write an query that will analyze the data stored in Primary Keys across multiple tables to find if they contain unicode characters in them.
Below are the schema of my tables:
mysql> SHOW CREATE TABLE employee_plain;
+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| employee_plain | CREATE TABLE `employee_plain` (
`emp_id` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
`emp_name` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`age` int(3) DEFAULT NULL,
PRIMARY KEY (`emp_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> SHOW CREATE TABLE employee_unicode;
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| employee_unicode | CREATE TABLE `employee_unicode` (
`emp_id` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
`emp_name` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`age` int(3) DEFAULT NULL,
PRIMARY KEY (`emp_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Below is the data store on both table. The table employee_unicode
contains unicode values in the PRIMARY KEY column:
mysql> select * from employee_plain;
+------------------------------+----------+------+
| emp_id | emp_name | age |
+------------------------------+----------+------+
| asdasd123 | abcsd | 12 |
| fsoiuioujvsdf4 | abvkd | 13 |
| sdfgjshgjshdfljsfklju4532489 | sdfsdff | 11 |
+------------------------------+----------+------+
3 rows in set (0.00 sec)
mysql> select * from employee_unicode;
+--------------------------------------------------------------+----------+------+
| emp_id | emp_name | age |
+--------------------------------------------------------------+----------+------+
| A ΠΛΦΟΙΚ ΑΕ#1420000000000000000 | sdfsf | 11 |
| sdfsdfsf234 | fsdfsd | 12 |
| ΑΣΕΛ - ΑΦΟΙ. ΣΕΛΙΔΗ Α.Ε.#000000000000000 | sdfsd | 13 |
| ΦΩΤΗΣ#10000000000 | sdfsdfd | 14 |
+--------------------------------------------------------------+----------+------+
4 rows in set (0.00 sec)
I tried various queries using ASCII, BINARY and REGEX:
mysql> SELECT
-> TABLE_NAME,
-> COLUMN_NAME,
-> COLUMN_TYPE,
-> IF( COLUMN_NAME REGEXP '[^x00-x7F]', 'Contains Unicode', 'No Unicode') AS Unicode_validation
-> FROM
-> information_schema.columns
-> WHERE
-> table_schema = 'amv_testdb' AND
-> COLUMN_KEY = 'PRI'
-> ORDER BY
-> TABLE_NAME,
-> ORDINAL_POSITION;
+------------------+-------------+--------------+--------------------+
| TABLE_NAME | COLUMN_NAME | COLUMN_TYPE | Unicode_validation |
+------------------+-------------+--------------+--------------------+
| employee_plain | emp_id | varchar(100) | No Unicode |
| employee_unicode | emp_id | varchar(100) | No Unicode |
+------------------+-------------+--------------+--------------------+
2 rows in set (0.00 sec)
mysql> SELECT
-> TABLE_NAME,
-> COLUMN_NAME,
-> COLUMN_TYPE,
-> IF( COLUMN_NAME <> CONVERT( COLUMN_NAME USING ASCII), 'No Unicode', 'Contains Unicode') AS Unicode_validation
-> FROM
-> information_schema.columns
-> WHERE
-> table_schema = 'amv_testdb' AND
-> COLUMN_KEY = 'PRI'
-> ORDER BY
-> TABLE_NAME,
-> ORDINAL_POSITION;
+------------------+-------------+--------------+--------------------+
| TABLE_NAME | COLUMN_NAME | COLUMN_TYPE | Unicode_validation |
+------------------+-------------+--------------+--------------------+
| employee_plain | emp_id | varchar(100) | Contains Unicode |
| employee_unicode | emp_id | varchar(100) | Contains Unicode |
+------------------+-------------+--------------+--------------------+
2 rows in set (0.00 sec)
mysql> SELECT
-> TABLE_NAME,
-> COLUMN_NAME,
-> COLUMN_TYPE,
-> IF(CONVERT(COLUMN_NAME USING BINARY) <> COLUMN_NAME, 'Contains Unicode', 'No Unicode') AS Unicode_validation
-> FROM
-> information_schema.columns
-> WHERE
-> table_schema = 'amv_testdb' AND
-> COLUMN_KEY = 'PRI'
-> ORDER BY
-> TABLE_NAME,
-> ORDINAL_POSITION;
+------------------+-------------+--------------+--------------------+
| TABLE_NAME | COLUMN_NAME | COLUMN_TYPE | Unicode_validation |
+------------------+-------------+--------------+--------------------+
| employee_plain | emp_id | varchar(100) | No Unicode |
| employee_unicode | emp_id | varchar(100) | No Unicode |
+------------------+-------------+--------------+--------------------+
2 rows in set (0.00 sec)
Please help me understand why I get incorrect results.
2
Answers
If looking for data that contains non-ASCII characters, then compare the stored value to a conversion of that value using ASCII e.g.
WHERE emp_id <> CONVERT(emp_id USING ASCII)
fiddle
will spot any
emp_id
that has at least one UTF-8 character. (That is,CHARACTER SET utf8
orutf8mb4
.)