Mysql - Better approach or database to store 50 million of range data and fast search operation

Pacemaker753
November 19, 2023
284 views
0 votes
2 Answers

Problem Statement: We will be getting a request of a number (11 digits) and have to efficiently lookup in the database and return a single row (last updated) based on range in which it fits.
Also, I have already used mysql and did load testing with proper indexes, however it is taking more time.

Current DB Structure:

Using MySQL

Currently, We have a table which has 2 columns i.e low_range and high_range which stores the range of data and there are 2 other columns which stores corresponding data i.e. is_active (values can 0 and 1) and code (int value which is id of another table i.e code_mapping).

Table 1 name: range_mapping

DB schema:

create table range_mapping (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`low_range` decimal(11,0) NOT NULL,
`high_range` decimal(11,0) NOT NULL,
`is_active` tinyint(1) NOT NULL DEFAULT 1,
`code` int(8) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_comp_is_active_low_high_range` (`is_active`, `low_range`, `high_range`)
) ENGINE=InnoDB AUTO_INCREMENT=26891234 DEFAULT CHARSET=utf8
Table 2 name: code_mapping

DB schema:

create table code_mapping (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(100) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `nameIdx` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=4410 DEFAULT CHARSET=utf8

Query to be optimized: I need to redesign or optimize the query for efficient and faster execution.

select a.low_range, a.high_range, b.name from range_mapping AS a 
LEFT JOIN code_mapping AS b ON a.code = b.id 
WHERE a.is_active = 1 and 12345678912 BETWEEN a.low_range AND a.high_range 
ORDER BY a.id DESC 
limit 1;

Explain query:

explain select a.low_range, a.high_range, b.name 
from range_mapping AS a force index(idx_comp_is_active_low_high_range) 
LEFT JOIN code_mapping AS b ON a.code = b.id 
WHERE a.is_active = 1 and 12345678912 BETWEEN a.low_range AND a.high_range 
ORDER BY a.id DESC limit 1;

Output:

id: 1
select_type: SIMPLE
table: a
type: range
possible_keys: idx_comp_is_active_low_high_range
key: idx_comp_is_active_low_high_range
key_len: 11
ref: NULL
rows: 227190
Extra: Using index condition; Using filesort

***************************
id: 1
select_type: SIMPLE
table: b
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: testbackup.a.code
rows: 1
Extra: Using where

Explain Analyze of query:

'-> Limit: 1 row(s)  (cost=65515 rows=0.104) (actual time=529..529 rows=1 loops=1)n    -> Nested loop left join  (cost=65515 rows=0.104) (actual time=529..529 rows=1 loops=1)n -> Filter: ((a.is_active = 1) and (535387481 between a.low_range and a.high_range))  (cost=0.21 rows=0.104) (actual time=529..529 rows=1 loops=1)n            -> Index scan on a using PRIMARY (reverse)  (cost=0.21 rows=2) (actual time=1.25..490 rows=531614 loops=1)n        -> Filter: (a.code = b.id)  (cost=0.961 rows=1) (actual time=0.0347..0.0347 rows=1 loops=1)n -> Single-row index lookup on b using PRIMARY (id=a.code)  (cost=0.961 rows=1) (actual time=0.0331..0.0331 rows=1 loops=1)n'

Request: 12345678912

Possible rows in table:

low_range: 12345678901 high_range: 12345678913 
low_range: 12345678910 high_range: 12345678912
low_range: 12345678902 high_range: 12345678920

Question:

Which database can be used to store millions of range data with efficient lookup. Any better approach with mysql if possible.

I have already used mysql and did load testing with proper indexes, however it is taking more time.
Expectation is the lookup should be fast and efficient ~500 ms.

Answers

PostgreSQL has built in range types, which offer a @> containment operator supported by GiST and SP-GiST indexes. Thing is, the performance of index lookups is driven by the distribution of your samples. If I purposefully move all filler samples out of the target range, I can pick out the targets out of 50M rows in 0.06 ms: demo1 (these demos are scaled-down examples of tests I did)

QUERY PLAN
Index Only Scan using test_r_idx on public.test (cost=0.42..26414.98 rows=457289 width=22) (actual time=0.043..0.047 rows=3 loops=1)
Output: r
Index Cond: (test.r @> ‘12345678912’::bigint)
Heap Fetches: 0
Planning Time: 0.234 ms
Execution Time: 0.060 ms

It’s the same if I generate them fully randomly, but tighten the range widths down to a 100 (the range widths in your example are 12, 2, 18) – on a scale of 50M, with numbers this high overlaps are unlikely, selectivity is high, index scan is rapid. If I bump up the range width limit to 100k generating more hits in the target range, it still takes just 0.16ms, 4.6ms if I go further up to 10M with even more matches. demo2

QUERY PLAN
Index Only Scan using test_r_idx on public.test (cost=0.42..648.35 rows=11196 width=22) (actual time=0.102..4.182 rows=12722 loops=1)
Output: r
Index Cond: (test.r @> ‘12345678912’::bigint)
Heap Fetches: 0
Planning Time: 0.056 ms
Execution Time: 4.584 ms

If I spawn 50M completely random ranges, most of them will span so wide they’re unlikely to not overlap and catch the target, making it all jump to 7s because it goes through nearly half the heap, since the index isn’t really helpful in a case like that

select setseed(0.1);
create table test (r int8range);
insert into test 
select int8range(least(a,b),greatest(a,b),'[]') from (
select (random()*2e10)::int8 a,(random()*2e10)::int8 b
from generate_series(1,(random()*15e6)::int))_;

insert into test values (int8range(12345678901,12345678913,'[]'));
insert into test 
select int8range(least(a,b),greatest(a,b),'[]') from (
select (random()*2e10)::int8 a,(random()*2e10)::int8 b
from generate_series(1,(random()*15e6)::int))_;

insert into test values (int8range(12345678910,12345678912,'[]'));
insert into test 
select int8range(least(a,b),greatest(a,b),'[]') from (
select (random()*2e10)::int8 a,(random()*2e10)::int8 b
from generate_series(1,(random()*15e6)::int))_;

insert into test values (int8range(12345678902 ,12345678920,'[]'));
insert into test 
select int8range(least(a,b),greatest(a,b),'[]') from (
select (random()*2e10)::int8 a,(random()*2e10)::int8 b
from generate_series(1,(select 5e7-count(*) from test)))_;

create index on test using gist(r);

vacuum analyze test;


explain analyze verbose
select * from test where r@>12345678912;

QUERY PLAN
Seq Scan on public.test (cost=0.00..943473.30 rows=23519755 width=22) (actual time=18.990..6629.042 rows=23618706 loops=1)
Output: r
Filter: (test.r @> ‘12345678912’::bigint)
Rows Removed by Filter: 26381294
Planning Time: 0.069 ms
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.355 ms, Inlining 3.575 ms, Optimization 11.550 ms, Emission 3.765 ms, Total 19.244 ms
Execution Time: 7608.159 ms

- RickJames
- November 19, 2023 at 7:25 am
- 0 votes
0
The code in https://mysql.rjweb.org/doc.php/ipranges shows how to improve the lookup for IP-addresses. It makes MySQL’s performance as fast as Postgres (essentially 1 row fetched for a simple lookup). You have DECIMAL(11,0); that should work equally well.

It does involve changing the structure to have only one number per row. If there are gaps, then they are handled by having extra dummy rows that have [presumably] is_active=false.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Mysql – Better approach or database to store 50 million of range data and fast search operation

Answers