Problem Statement: We will be getting a request of a number (11 digits) and have to efficiently lookup in the database and return a single row (last updated) based on range in which it fits.
Also, I have already used mysql and did load testing with proper indexes, however it is taking more time.
Current DB Structure:
Using MySQL
Currently, We have a table which has 2 columns i.e low_range and high_range which stores the range of data and there are 2 other columns which stores corresponding data i.e. is_active (values can 0 and 1) and code (int value which is id of another table i.e code_mapping).
Table 1 name: range_mapping
DB schema:
create table range_mapping (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`low_range` decimal(11,0) NOT NULL,
`high_range` decimal(11,0) NOT NULL,
`is_active` tinyint(1) NOT NULL DEFAULT 1,
`code` int(8) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_comp_is_active_low_high_range` (`is_active`, `low_range`, `high_range`)
) ENGINE=InnoDB AUTO_INCREMENT=26891234 DEFAULT CHARSET=utf8
Table 2 name: code_mapping
DB schema:
create table code_mapping (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(100) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `nameIdx` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=4410 DEFAULT CHARSET=utf8
Query to be optimized: I need to redesign or optimize the query for efficient and faster execution.
select a.low_range, a.high_range, b.name from range_mapping AS a
LEFT JOIN code_mapping AS b ON a.code = b.id
WHERE a.is_active = 1 and 12345678912 BETWEEN a.low_range AND a.high_range
ORDER BY a.id DESC
limit 1;
Explain query:
explain select a.low_range, a.high_range, b.name
from range_mapping AS a force index(idx_comp_is_active_low_high_range)
LEFT JOIN code_mapping AS b ON a.code = b.id
WHERE a.is_active = 1 and 12345678912 BETWEEN a.low_range AND a.high_range
ORDER BY a.id DESC limit 1;
Output:
id: 1
select_type: SIMPLE
table: a
type: range
possible_keys: idx_comp_is_active_low_high_range
key: idx_comp_is_active_low_high_range
key_len: 11
ref: NULL
rows: 227190
Extra: Using index condition; Using filesort
***************************
id: 1
select_type: SIMPLE
table: b
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: testbackup.a.code
rows: 1
Extra: Using where
Explain Analyze of query:
'-> Limit: 1 row(s) (cost=65515 rows=0.104) (actual time=529..529 rows=1 loops=1)n -> Nested loop left join (cost=65515 rows=0.104) (actual time=529..529 rows=1 loops=1)n -> Filter: ((a.is_active = 1) and (535387481 between a.low_range and a.high_range)) (cost=0.21 rows=0.104) (actual time=529..529 rows=1 loops=1)n -> Index scan on a using PRIMARY (reverse) (cost=0.21 rows=2) (actual time=1.25..490 rows=531614 loops=1)n -> Filter: (a.code = b.id) (cost=0.961 rows=1) (actual time=0.0347..0.0347 rows=1 loops=1)n -> Single-row index lookup on b using PRIMARY (id=a.code) (cost=0.961 rows=1) (actual time=0.0331..0.0331 rows=1 loops=1)n'
Request: 12345678912
Possible rows in table:
low_range: 12345678901 high_range: 12345678913
low_range: 12345678910 high_range: 12345678912
low_range: 12345678902 high_range: 12345678920
Question:
Which database can be used to store millions of range data with efficient lookup. Any better approach with mysql if possible.
I have already used mysql and did load testing with proper indexes, however it is taking more time.
Expectation is the lookup should be fast and efficient ~500 ms.
2
Answers
PostgreSQL has built in range types, which offer a
@>
containment operator supported by GiST and SP-GiST indexes. Thing is, the performance of index lookups is driven by the distribution of your samples. If I purposefully move all filler samples out of the target range, I can pick out the targets out of 50M rows in 0.06 ms: demo1 (these demos are scaled-down examples of tests I did)It’s the same if I generate them fully randomly, but tighten the range widths down to a 100 (the range widths in your example are 12, 2, 18) – on a scale of 50M, with numbers this high overlaps are unlikely, selectivity is high, index scan is rapid. If I bump up the range width limit to 100k generating more hits in the target range, it still takes just 0.16ms, 4.6ms if I go further up to 10M with even more matches. demo2
If I spawn 50M completely random ranges, most of them will span so wide they’re unlikely to not overlap and catch the target, making it all jump to 7s because it goes through nearly half the heap, since the index isn’t really helpful in a case like that
The code in https://mysql.rjweb.org/doc.php/ipranges shows how to improve the lookup for IP-addresses. It makes MySQL’s performance as fast as Postgres (essentially 1 row fetched for a simple lookup). You have
DECIMAL(11,0)
; that should work equally well.It does involve changing the structure to have only one number per row. If there are gaps, then they are handled by having extra dummy rows that have [presumably] is_active=false.