How can I probe optimal datatype from existing MySQL columns with PHP?

user16861522
April 18, 2023
217 views
3 votes
4 Answers

I have a MySQL table with ~400 columns

The columns are accumulated from external data so I don’t really have control over the count initially at least

In the interest of avoiding errors whilst making columns, these are just TEXT datatype and now have all of the rows that they will likely ever need

I am now trying to optimise the data, which I have an idea of how to do, but again, I want to try to avoid errors if possible

TLDR: I would like to iterate each column and determine the datatype (i.e. TINYINT, INT, VARCHAR(?), CHAR(?), BOOLEAN, TEXT, etc.)

I understand how to loop over the data, but I am more concerned about accuracy and ensuring data integrity, so what I am asking for is if anyone knows any resources or links to existing code that will perform this exact function?

I don’t want to miss any nuances or special cases for any of these data types, and it just makes sense not to reinvent the wheel if a resource exists?

Answers

Chosen as BEST ANSWER

After much browsing, and investigating Tableschema, it seemed that the fastest way to solve the problem would be to iterate through the data

Basic overview as I can't provide the actual code, but this should be enough to save someone a lot of time in future - also make sure to zero out or null everything before each loop of columns

ini_set('max_execution_time', 0);
ini_set('memory_limit','1G');

$stmt = $mysqli->query("SHOW COLUMNS FROM `table`");
//...


foreach($columns as $column){
    $int = true;
    $decimal = true;
    $boolean = true;
    $intmax = 0;
    $intneg = false;
    $integer_part = 0;
    $decimal_part = 0;
    $decimal_neg = false;
    $consistent_len = 0;
    $consistent_strlen = null;
    $nullable = false;
    $boolean_counter = 0;
    $boolean_array = array();
    $strlen = 0;

    $stmt = $mysqli->query("SELECT `$column` FROM `table`");
    //...
    foreach($rowdata as $line){
        if(empty($line)){
            $nullable = true;
            continue;
        }
        else{
            $data = trim($line);
        }
        // turn off $int, $decimal, $boolean to false based on criteria
        // mb_strlen($data, 'UTF-8'); helpful for varchar multibyte
    }
    // Check order of preference > BOOLEAN > INT (Variants) > DECIMAL > CHAR > VARCHAR > TEXT

    // define $type, $unsigned, $is_null

    $column_type = $type.$unsigned.$is_null;
    $alter_query = "ALTER TABLE `table` MODIFY COLUMN `$column` $column_type";
}

(Edit)

- OJones
- April 18, 2023 at 12:30 pm
- 0 votes
0
This is a job for the information_schema.

This query does what you need.
```
SELECT TABLE_NAME, COLUMN_NAME, COLUMN_TYPE, COLUMN_DEFAULT
  FROM information_schema.COLUMNS
 WHERE TABLE_NAME = 'whatever_table_name'
   AND TABLE_SCHEMA = DATABASE() 
 ORDER BY ORDINAL_POSITION;
```
Good interactive tools like HeidiSQL and MySQL Workbench can also display the metadata (data describing data) you need in a useful form.

Another possible trick: export a random selection of a few thousand rows to a .csv file and import it into Excel or LibreOffice Calc. Those programs will try to guess the data types.

The interactive tools, or the command-line tool, can do the exporting. The query to get the randomly selected rows into a .csv file is something like
```
SELECT *
  FROM table
 WHERE id IN (SELECT id FROM table ORDER BY RAND() LIMIT 1000)
  INTO OUTFILE 'sample.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
 LINES TERMINATED BY 'n';
```
This assumes id is a primary key.

This will be a bit slow, so don’t do it too often.
Login or Signup to reply.

- NevilleKuyt
- April 18, 2023 at 12:46 pm
- 0 votes
0
This is a hard problem – there are a few options available to you. The most complete solution I’ve found is in Python, and called Tableschema. You feed it data – CSV is easiest – and then ask it to infer the schema.

So, you could use @o.jones’s code to export a random 1000 rows to CSV, then feed that into Tableschema, and use that to create a table with more accurate field types.

Login or Signup to reply.

- RickJames
- April 18, 2023 at 9:55 pm
- 0 votes
0
```
SELECT 
    MIN(col1), MAX(col1) MAX(CHAR_LENGTH(col1)),
    MIN(col2), MAX(col2) MAX(CHAR_LENGTH(col2)),
    ...
   FROM tbl;
```
That will give you some feel for the datatypes and their sizes. (Do substitute the actual column names above.)
- If both max and min look numeric and max len is small, then probably some numeric type.
  - Consider DECIMAL/FLOAT if you see certain punctuation.
  - Consider UNSIGNED if neither the max nor min look negative.
- If max or min are non-numeric and the max len is under 100, use twice the max len in VARCHAR(...).
- If max or min are non-numeric with a bigger max len, use some variation of TEXT.
There is doubtless a REGEXP that would help distinguish numeric from string. Something like
```
col1 REGEXP '^[0-9]$'      -- some size of INT UNSIGNED
col1 REGEXP '^[-+0-9]$'    -- some size of INT SIGNED
col1 REGEXP '^[-+0-9.]$'   -- DECIMAL(..., ...)
col1 REGEXP '^[-+0-9.e]$'  -- FLOAT or DOUBLE
```
Variations:
```
TINYINT  SMALLINT  MEDIUMINT  INT  BIGINT
TEXT  MEDIUMTEXT  LONGTEXT  (don't use TINYTEXT)
```
You could query information_schema.COLUMNS to find the column names and automatically generate the above SELECT. (I would not bother unless you have lots of columns to investigate.)

The above does not include BINARY/BLOB, character sets, (m,n) for DECIMAL, ENUM.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.