skip to Main Content

I have a huge table for load to Cloud (BigQuery).
Table have very simple structure:

CREATE TABLE `my_huge_table` (
                                 `user_id` int(10) NOT NULL,
                                 `value_id` int(11) NOT NULL,
                                 `some_value` tinyint(2) unsigned NOT NULL DEFAULT '1',
                                 `created_at` datetime DEFAULT NULL,
                                 `updated_at` datetime DEFAULT NULL,
                                 UNIQUE KEY `user_segment_UNIQUE` (`user_id`,`value_id`),
                                 KEY `user_idx` (`user_id`),
                                 KEY `value_idx` (`value_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

The table has about 850 000 000 (850M) rows .

And we have info from information_schema (result values formatted for humans):

select TABLE_ROWS, DATA_LENGTH, INDEX_LENGTH
from information_schema.TABLES where TABLE_NAME = "my_huge_table"
"TABLE_ROWS"    "849 536 609" 
"DATA_LENGTH"   "59 544 354 816"
"INDEX_LENGTH"  "38 034 964 480"

Bigquery allow loading chunks less then 4Gb. I’m determine rows count for every chunk and receive pack of queries with LIMIT, OFFSET. For example:

....
SELECT * INTO OUTFILE 'my_huge_table_720000000_10000000.bulk' FROM `my_huge_table` t1 limit 10000000 offset 720000000;
SELECT * INTO OUTFILE 'my_huge_table_730000000_10000000.bulk' FROM `my_huge_table` t1 limit 10000000 offset 730000000;
...

And its so slow! And this not it all.
There data with increase query time (seconds) for every chunk:

M   seconds
10  32
20  35
30  38
40  40
50  44
60  45
70  49
80  51
90  53
100 57
110 59
120 63
130 65
140 66
150 71
160 73
170 75
180 77
190 80
200 83
210 87
220 88
230 90
240 94
250 94
260 99
270 100
280 109
290 118
300 107
310 113
320 115
330 118
340 121
350 124
360 127
370 130
380 130
390 135
400 139
410 140
420 144
430 144
440 146
450 148
460 149
470 151
480 155
490 160
500 164
510 164
520 165
530 170
540 169
550 173
560 178
570 183
580 185
590 186
600 206
610 380
620 388
630 399
640 411
650 426
660 453
670 491
680 504
690 779
700 795
710 901
720 922
730 956
740 936
750 972
760 1010
770 1027
780 1044
790 1047
800 980
810 998

Question: How do I increase speed for all of that queries?

2

Answers


  1. You have discovered an inconvenient truth. LIMIT / OFFSET is ridiculously inefficient for large offsets.

    Are you running on Linux or another UNIX-like? Do you have enough drive space to hold twice the whole text file, at least temporarily?

    If so, you could use split to break it into chunks. Something like this will break it into chunks of roughly 2GiB each.

    split --line-bytes=2G --suffix-length=4 my_huge_table.bulk
    
    Login or Signup to reply.
  2. As has already been pointed out, OFFSET is very inefficient. It is, however, possible to iterate using a compound key; see
    https://mysql.rjweb.org/doc.php/deletebig#iterating_through_a_compound_key

    As for chunk size, I doubt if there is much difference between 1000 rows at a time and 4GB at a time.

    Alternatively, if no user_id has too many rows, consider doing one user at a time. Again, "remember where you left off"; do not use OFFSET:

    SELECT user_id FROM t
        WHERE user_id > $left_off
        ORDER BY user_id, LIMIT 1;
    

    PS: There is no need to have KEY(user_id) since `UNIQUE(user_id, …) provides such.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search