I have a MYSQL InnoDB table table
with the following columns (table and column names changed):
- date (PK, DATE)
- var_a (PK, FK, INT)
- var_b (PK, FK, INT)
- rel_ab (DECIMAL)
where rel_ab
is a column that describes a relationship between 2 variables var_a
and var_b
at a given date. (var_a
and var_b
reference different tables)
The data is uploaded in daily batches, totaling around 7 million rows per day. The problem is, after only a few weeks, it is starting to take many hours to upload each new daily batch. Clearly we need to improve our table design. Here are some additional details of our table.
- We use
COMPRESSION="zlib"
. - In addition to our composite primary key, we also have indexes on columns
var_a
andvar_b
, necessitated by the foreign keys. - When we pull data from this table, it is always with the query
SELECT * FROM table WHERE date = <date>
for a given date<date>
. Selecting only takes a couple minutes. - We will (with near certainty) never have a reason to delete entries from the tables that
var_a
andvar_b
refer to. - Data is uploaded by the pandas function
df.to_sql('temp', con, if_exists='replace', index=False, method='multi')
, where we insert ignore fromtemp
totable
and then droptemp
.
As a result, I was planning on doing at least one of the following:
- Removing the foreign key constraints on columns
var_a
andvar_b
and relying on the data uploading process to do everything correctly. This is because in our use-case, neither index actually speeds up the query speeds. - Dividing the table to have different tables for each date. For example, I would have a table called
table_230501
with columnsvar_a
,var_b
,rel_ab
. This is because we only ever select from one date at a time.
I know that the first solution could threaten data integrity, while the second solution would clutter our schema. In my limited experience, I’ve also never heard of the second option being done and couldn’t find any example of this design online. Are either of these options a wise solution? Both will improve upload speed and decrease disk usage but both have their cons as well. Otherwise, what are other ways that I can improve upload speed?
Edit: My SHOW CREATE TABLE
should look like
CREATE TABLE table (
date date NOT NULL,
var_a int NOT NULL,
var_b int NOT NULL,
rel_ab decimal(19,16) NOT NULL,
PRIMARY KEY (date,`var_a`,`var_b`),
KEY a_idx (var_a),
KEY b_idx (var_b),
CONSTRAINT a FOREIGN KEY (var_a) REFERENCES other_table_a (var_a) ON DELETE RESTRICT ON UPDATE CASCADE,
CONSTRAINT b FOREIGN KEY (var_b) REFERENCES other_table_b (var_b) ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMPRESSION="zlib"
2
Answers
There are a few potential solutions that could help you improve the upload speed of your MySQL table:
Remove indexes on var_a and var_b: Since you are not using these indexes to speed up queries, removing them could help speed up the upload process. However, if you are using foreign key constraints, it is generally recommended to keep indexes on the columns that are part of the foreign keys.
Partition the table by date: Partitioning can help with query performance, as it allows the database to only scan the relevant partitions for a given query. However, it can also make maintenance and backups more complex, and it may not be necessary if your queries are already performing well.
Use a bulk insert method: Rather than using df.to_sql to insert individual rows, you could try using a bulk insert method like LOAD DATA INFILE or the MySQL bulk insert API. This can be faster than individual inserts, especially if you can upload data in batches rather than one row at a time.
Use a different compression algorithm: You are currently using zlib compression, but there are other compression algorithms that may be faster or more efficient for your data. You could try experimenting with different compression options to see if they improve upload speed.
Increase server resources: If you have the budget and resources, upgrading your server hardware or increasing the number of servers could help with upload speed. This may not be a feasible option for everyone, but it is worth considering if you have exhausted other options.
In terms of the options you suggested, removing foreign key constraints could potentially lead to data integrity issues, so I would not recommend that approach. Partitioning by date could be a good solution if you are already experiencing performance issues with your queries, but it may not be necessary if your queries are already running quickly.
To speed up the upload, get rid of it. Seriously, why put the data in a table if the only thing you do is fetch exactly what was in one date’s file? (Your Comment points out that the single file is really a few files. Combining them first is probably a good idea.)
If you do need the data in a table, let’s discuss these…
SHOW CREATE TABLE
; what you provided may have left out some subtle items.LOAD DATA
? Hopefully not one rowINSERTed
at a time. I don’t know how Pandas works. (Nor how the 99 other packages that ‘simplify’ MySQL access work.) Please find out what it does under the covers. You may have to bypass Pandas to get better performance. Bulk loading is at least 10 times as fast as row-by-row.FLOAT
?INT
to reference them. Switching to a 3-byteMEDIUMINT [UNSIGNED]
would save at least 7MB per day.Multiple ‘identical’ tables is always unwise. One table is always better. But, as suggested above, zero tables is still better.