I have a MySQL table with ~400 columns
The columns are accumulated from external data so I don’t really have control over the count initially at least
In the interest of avoiding errors whilst making columns, these are just TEXT datatype and now have all of the rows that they will likely ever need
I am now trying to optimise the data, which I have an idea of how to do, but again, I want to try to avoid errors if possible
TLDR: I would like to iterate each column and determine the datatype (i.e. TINYINT, INT, VARCHAR(?), CHAR(?), BOOLEAN, TEXT, etc.)
I understand how to loop over the data, but I am more concerned about accuracy and ensuring data integrity, so what I am asking for is if anyone knows any resources or links to existing code that will perform this exact function?
I don’t want to miss any nuances or special cases for any of these data types, and it just makes sense not to reinvent the wheel if a resource exists?
4
Answers
After much browsing, and investigating Tableschema, it seemed that the fastest way to solve the problem would be to iterate through the data
Basic overview as I can't provide the actual code, but this should be enough to save someone a lot of time in future - also make sure to zero out or null everything before each loop of columns
This is a job for the information_schema.
This query does what you need.
Good interactive tools like HeidiSQL and MySQL Workbench can also display the metadata (data describing data) you need in a useful form.
Another possible trick: export a random selection of a few thousand rows to a .csv file and import it into Excel or LibreOffice Calc. Those programs will try to guess the data types.
The interactive tools, or the command-line tool, can do the exporting. The query to get the randomly selected rows into a .csv file is something like
This assumes
id
is a primary key.This will be a bit slow, so don’t do it too often.
This is a hard problem – there are a few options available to you. The most complete solution I’ve found is in Python, and called Tableschema. You feed it data – CSV is easiest – and then ask it to infer the schema.
So, you could use @o.jones’s code to export a random 1000 rows to CSV, then feed that into Tableschema, and use that to create a table with more accurate field types.
That will give you some feel for the datatypes and their sizes. (Do substitute the actual column names above.)
DECIMAL
/FLOAT
if you see certain punctuation.UNSIGNED
if neither the max nor min look negative.VARCHAR(...)
.TEXT
.There is doubtless a REGEXP that would help distinguish numeric from string. Something like
Variations:
You could query
information_schema.COLUMNS
to find the column names and automatically generate the aboveSELECT
. (I would not bother unless you have lots of columns to investigate.)The above does not include
BINARY
/BLOB
, character sets,(m,n)
forDECIMAL
,ENUM
.