How to read a Flatfile in batches [Spring Batch 2.7.8 , Mysql 8]

dew025
February 24, 2023
271 views
0 votes
2 Answers

I have a flat file with 3200000 rows and performance of application decreases as time progresses.
For e.g 100000 rows in 30 sec to 320000 in total time of about 4 hours.
I am using jdbc template for batch saving . Also tried async read with completable future
database used = Mysql. with 2 indexes – 1 primary key , 1 unique index

expected lower time.

Answers

- RickJames
- February 24, 2023 at 6:08 pm
- 0 votes
0
"Replacing" a table: If you are really "replacing" the entire table, this is much better.

For example, are you getting a replacement for your table every night? If so there is a much less invasive way to do such. (Not faster, but I suspect that is not the real issue.) The following keeps the real table "online" except for a very brief, ACID, RENAME.
```
    CREATE TABLE new LIKE real;
    load up `new` by whatever means
    RENAME TABLE real TO old,
                 new TO real;
    DROP TABLE old;
```
LOAD DATA INFILE (applies for either adding or replacing)

If the "flat file" is comma- or tab-separated, then see if LOAD DATA can be used. It is very fast.

Batch loading
- I recommend 100-1000 rows per batch. (100K rows at a shot may so big as to be leading to other slowdowns. This is especially bad if replication is involved.)
- commit after each batch. (Or use autocommit=ON.) (Saving 3.2M rows in the undo buffer is costly.)
Other notes

Rethink whether you really need two unique indexes (the PK is one). That doubles some of the work of inserting each row.
Login or Signup to reply.

- RickJames
- February 24, 2023 at 6:44 pm
- 0 votes
0
(This partially addresses the "slow down" part of the question.)

The setting for innodb_buffer_pool_size should be about 70% of RAM. (However, if RAM is less than 4GB or you have lots of RAM taken up by other apps, that percentage is too high.)

If the buffer_pool grows bigger than the available RAM, then "swapping" comes into play. This is terrible for performance.

How big is the table? If it has lots of big text columns, then it could be quite big. Big implies that it might overflow the buffer_pool and "slow down" as more rows are added.

Indexing with UUIDs also can cause a slowdown due to overflowing the buffer_pool. This is because of the randomness of such indexes. More: UUIDs

Do you really need both unique indexes? Here’s the issue. When inserting a row, all unique indexes are checked before moving on to the next row.
- AUTO_INCREMENT and not specified — appends to ‘end’ of table
- UUID — Jumps around ‘randomly’.
- DATETIME / TIMESTAMP — If the incoming data is roughly ordered by datetime, then an index on such will be augmented in an efficient way.
- Other index types — perhaps somewhere in between auto_inc and uuid.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.