skip to Main Content

I have a MYSQL InnoDB table table with the following columns (table and column names changed):

  • date (PK, DATE)
  • var_a (PK, FK, INT)
  • var_b (PK, FK, INT)
  • rel_ab (DECIMAL)

where rel_ab is a column that describes a relationship between 2 variables var_a and var_b at a given date. (var_a and var_b reference different tables)

The data is uploaded in daily batches, totaling around 7 million rows per day. The problem is, after only a few weeks, it is starting to take many hours to upload each new daily batch. Clearly we need to improve our table design. Here are some additional details of our table.

  • We use COMPRESSION="zlib".
  • In addition to our composite primary key, we also have indexes on columns var_a and var_b, necessitated by the foreign keys.
  • When we pull data from this table, it is always with the query SELECT * FROM table WHERE date = <date> for a given date <date>. Selecting only takes a couple minutes.
  • We will (with near certainty) never have a reason to delete entries from the tables that var_a and var_b refer to.
  • Data is uploaded by the pandas function df.to_sql('temp', con, if_exists='replace', index=False, method='multi'), where we insert ignore from temp to table and then drop temp.

As a result, I was planning on doing at least one of the following:

  • Removing the foreign key constraints on columns var_a and var_b and relying on the data uploading process to do everything correctly. This is because in our use-case, neither index actually speeds up the query speeds.
  • Dividing the table to have different tables for each date. For example, I would have a table called table_230501 with columns var_a, var_b, rel_ab. This is because we only ever select from one date at a time.

I know that the first solution could threaten data integrity, while the second solution would clutter our schema. In my limited experience, I’ve also never heard of the second option being done and couldn’t find any example of this design online. Are either of these options a wise solution? Both will improve upload speed and decrease disk usage but both have their cons as well. Otherwise, what are other ways that I can improve upload speed?

Edit: My SHOW CREATE TABLE should look like

CREATE TABLE table (
  date date NOT NULL,
  var_a int NOT NULL,
  var_b int NOT NULL,
  rel_ab decimal(19,16) NOT NULL,
  PRIMARY KEY (date,`var_a`,`var_b`),
  KEY a_idx (var_a),
  KEY b_idx (var_b),
  CONSTRAINT a FOREIGN KEY (var_a) REFERENCES other_table_a (var_a) ON DELETE RESTRICT ON UPDATE CASCADE,
  CONSTRAINT b FOREIGN KEY (var_b) REFERENCES other_table_b (var_b) ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMPRESSION="zlib"

2

Answers


  1. There are a few potential solutions that could help you improve the upload speed of your MySQL table:

    Remove indexes on var_a and var_b: Since you are not using these indexes to speed up queries, removing them could help speed up the upload process. However, if you are using foreign key constraints, it is generally recommended to keep indexes on the columns that are part of the foreign keys.

    Partition the table by date: Partitioning can help with query performance, as it allows the database to only scan the relevant partitions for a given query. However, it can also make maintenance and backups more complex, and it may not be necessary if your queries are already performing well.

    Use a bulk insert method: Rather than using df.to_sql to insert individual rows, you could try using a bulk insert method like LOAD DATA INFILE or the MySQL bulk insert API. This can be faster than individual inserts, especially if you can upload data in batches rather than one row at a time.

    Use a different compression algorithm: You are currently using zlib compression, but there are other compression algorithms that may be faster or more efficient for your data. You could try experimenting with different compression options to see if they improve upload speed.

    Increase server resources: If you have the budget and resources, upgrading your server hardware or increasing the number of servers could help with upload speed. This may not be a feasible option for everyone, but it is worth considering if you have exhausted other options.

    In terms of the options you suggested, removing foreign key constraints could potentially lead to data integrity issues, so I would not recommend that approach. Partitioning by date could be a good solution if you are already experiencing performance issues with your queries, but it may not be necessary if your queries are already running quickly.

    Login or Signup to reply.
  2. To speed up the upload, get rid of it. Seriously, why put the data in a table if the only thing you do is fetch exactly what was in one date’s file? (Your Comment points out that the single file is really a few files. Combining them first is probably a good idea.)

    If you do need the data in a table, let’s discuss these…

    • We must see all the main queries before settling on the indexes.
    • The order of the columns in the PK is important both for loading and for querying.
    • Partitioning may help with loading, but is unlikely to help with queries. Exception: Will you be dropping ‘old’ data?
    • Please provide SHOW CREATE TABLE; what you provided may have left out some subtle items.
    • How is the loading done? One giant LOAD DATA? Hopefully not one row INSERTed at a time. I don’t know how Pandas works. (Nor how the 99 other packages that ‘simplify’ MySQL access work.) Please find out what it does under the covers. You may have to bypass Pandas to get better performance. Bulk loading is at least 10 times as fast as row-by-row.
    • I don’t yet see a need for a temp table when loading. Well, maybe. If you get rid of the FKs (as you suggested), you could do queries to validate the existence of var_a and var_b in the other tables. That is ‘simulate FK’.
    • If practical, sort the incoming data according to the PK. (This may be the source of the growing sluggishness.)
    • Are there any secondary keys? They will impact load speed.
    • I think your FKs imply indexes into the other tables.
    • Are you adding new rows to the other tables?
    • "rel_ab (DECIMAL)" — How many decimal places? What is the exact declaration? If it is some measurement, did you consider FLOAT?
    • Now many rows in the other tables. That is, do you really need a 4-byte INT to reference them. Switching to a 3-byte MEDIUMINT [UNSIGNED] would save at least 7MB per day.
    • What do you do with the 7M rows from that SELECT?
    • No Compression. It is inefficient in InnoDB. Only one of the 4 columns is likely to be compressible. Compression requires extra buffer_pool_space. Compression takes a lot of CPU. 2x shrinkage is typical for InnoDB.

    Multiple ‘identical’ tables is always unwise. One table is always better. But, as suggested above, zero tables is still better.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search