skip to Main Content

I have a MySQL table with ~400 columns

The columns are accumulated from external data so I don’t really have control over the count initially at least

In the interest of avoiding errors whilst making columns, these are just TEXT datatype and now have all of the rows that they will likely ever need

I am now trying to optimise the data, which I have an idea of how to do, but again, I want to try to avoid errors if possible

TLDR: I would like to iterate each column and determine the datatype (i.e. TINYINT, INT, VARCHAR(?), CHAR(?), BOOLEAN, TEXT, etc.)

I understand how to loop over the data, but I am more concerned about accuracy and ensuring data integrity, so what I am asking for is if anyone knows any resources or links to existing code that will perform this exact function?

I don’t want to miss any nuances or special cases for any of these data types, and it just makes sense not to reinvent the wheel if a resource exists?

4

Answers


  1. Chosen as BEST ANSWER

    After much browsing, and investigating Tableschema, it seemed that the fastest way to solve the problem would be to iterate through the data

    Basic overview as I can't provide the actual code, but this should be enough to save someone a lot of time in future - also make sure to zero out or null everything before each loop of columns

    ini_set('max_execution_time', 0);
    ini_set('memory_limit','1G');
    
    $stmt = $mysqli->query("SHOW COLUMNS FROM `table`");
    //...
    
    
    foreach($columns as $column){
        $int = true;
        $decimal = true;
        $boolean = true;
        $intmax = 0;
        $intneg = false;
        $integer_part = 0;
        $decimal_part = 0;
        $decimal_neg = false;
        $consistent_len = 0;
        $consistent_strlen = null;
        $nullable = false;
        $boolean_counter = 0;
        $boolean_array = array();
        $strlen = 0;
    
        $stmt = $mysqli->query("SELECT `$column` FROM `table`");
        //...
        foreach($rowdata as $line){
            if(empty($line)){
                $nullable = true;
                continue;
            }
            else{
                $data = trim($line);
            }
            // turn off $int, $decimal, $boolean to false based on criteria
            // mb_strlen($data, 'UTF-8'); helpful for varchar multibyte
        }
        // Check order of preference > BOOLEAN > INT (Variants) > DECIMAL > CHAR > VARCHAR > TEXT
    
        // define $type, $unsigned, $is_null
    
        $column_type = $type.$unsigned.$is_null;
        $alter_query = "ALTER TABLE `table` MODIFY COLUMN `$column` $column_type";
    }
    
    

  2. This is a job for the information_schema.

    This query does what you need.

    SELECT TABLE_NAME, COLUMN_NAME, COLUMN_TYPE, COLUMN_DEFAULT
      FROM information_schema.COLUMNS
     WHERE TABLE_NAME = 'whatever_table_name'
       AND TABLE_SCHEMA = DATABASE() 
     ORDER BY ORDINAL_POSITION;
    

    Good interactive tools like HeidiSQL and MySQL Workbench can also display the metadata (data describing data) you need in a useful form.

    Another possible trick: export a random selection of a few thousand rows to a .csv file and import it into Excel or LibreOffice Calc. Those programs will try to guess the data types.

    The interactive tools, or the command-line tool, can do the exporting. The query to get the randomly selected rows into a .csv file is something like

    SELECT *
      FROM table
     WHERE id IN (SELECT id FROM table ORDER BY RAND() LIMIT 1000)
      INTO OUTFILE 'sample.csv'
    FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
     LINES TERMINATED BY 'n';
    

    This assumes id is a primary key.

    This will be a bit slow, so don’t do it too often.

    Login or Signup to reply.
  3. This is a hard problem – there are a few options available to you. The most complete solution I’ve found is in Python, and called Tableschema. You feed it data – CSV is easiest – and then ask it to infer the schema.

    So, you could use @o.jones’s code to export a random 1000 rows to CSV, then feed that into Tableschema, and use that to create a table with more accurate field types.

    Login or Signup to reply.
  4. SELECT 
        MIN(col1), MAX(col1) MAX(CHAR_LENGTH(col1)),
        MIN(col2), MAX(col2) MAX(CHAR_LENGTH(col2)),
        ...
       FROM tbl;
    

    That will give you some feel for the datatypes and their sizes. (Do substitute the actual column names above.)

    • If both max and min look numeric and max len is small, then probably some numeric type.
      • Consider DECIMAL/FLOAT if you see certain punctuation.
      • Consider UNSIGNED if neither the max nor min look negative.
    • If max or min are non-numeric and the max len is under 100, use twice the max len in VARCHAR(...).
    • If max or min are non-numeric with a bigger max len, use some variation of TEXT.

    There is doubtless a REGEXP that would help distinguish numeric from string. Something like

    col1 REGEXP '^[0-9]$'      -- some size of INT UNSIGNED
    col1 REGEXP '^[-+0-9]$'    -- some size of INT SIGNED
    col1 REGEXP '^[-+0-9.]$'   -- DECIMAL(..., ...)
    col1 REGEXP '^[-+0-9.e]$'  -- FLOAT or DOUBLE
    

    Variations:

    TINYINT  SMALLINT  MEDIUMINT  INT  BIGINT
    TEXT  MEDIUMTEXT  LONGTEXT  (don't use TINYTEXT)
    

    You could query information_schema.COLUMNS to find the column names and automatically generate the above SELECT. (I would not bother unless you have lots of columns to investigate.)

    The above does not include BINARY/BLOB, character sets, (m,n) for DECIMAL, ENUM.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search