Consider an example table of items with three columns:
id
(UUID, primary key)time
(TIMESTAMPTZ)store_id
(UUID,
foreign key)
There is a covering B-tree index on (store_id, time) INCLUDING (id)
.
Finding the top-N items
I’d like to write a query that efficiently gets the 20 newest items across a specified set of stores. There may be tens or hundreds of thousands of rows per store_id
. This is a relatively straightforward query that produces the correct result:
SELECT id, time
FROM items
WHERE store_id = ANY(ARRAY[
'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa',
'bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb',
...
]::uuid[])
ORDER BY time DESC
LIMIT 20;
However, according to the query plan from EXPLAIN (ANALYZE, BUFFERS)
, Postgres reads over ten thousand buffer pages despite the LIMIT 20
clause. Even with an SSD, this is slow. What appears to be happening is Postgres reads all the index entries before sorting them.
Query plan
This query plan was generated after this question was first posted and I had performed some other database optimizations like reindexing and vacuuming. The query runs reasonably quickly now but still accesses over 7900 buffer pages.
Limit (cost=553.97..554.02 rows=20 width=56) (actual time=214.618..214.624 rows=21 loops=1)
Output: id, time
Buffers: shared hit=7606 read=369 dirtied=1
I/O Timings: read=205.188
-> Sort (cost=553.97..574.15 rows=8073 width=56) (actual time=214.617..214.620 rows=20 loops=1)
Output: id, time
Sort Key: items.time DESC
Sort Method: top-N heapsort Memory: 28kB
Buffers: shared hit=7606 read=369 dirtied=1
I/O Timings: read=205.188
-> Index Only Scan using items_store_id_time_covering_id_idx on public.items (cost=0.43..336.31 rows=8073 width=56) (actual time=0.730..211.783 rows=11920 loops=1)
Output: id, time
Index Cond: (items.store_id = ANY ('{aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb, ...}'::uuid[]))
Heap Fetches: 282
Buffers: shared hit=7606 read=369 dirtied=1
I/O Timings: read=205.188
Settings: effective_cache_size = '3052336kB', random_page_cost = '1.1'
Query Identifier: -1648601102884428975
Planning Time: 0.160 ms
Execution Time: 214.647 ms
Lateral joins – 50-100x faster
A workaround that helps immensely is to use a lateral join to iteratively load only the 20 newest items per store_id
and then take the newest 20 of them all. This loads just a couple hundred buffer pages and is about 50-100x more efficient for my test workload!
SELECT id, time
FROM unnest(ARRAY[
'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa',
'bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb',
...
]::uuid[]) AS store_ids(store_id)
JOIN LATERAL (
SELECT id, time
FROM items
WHERE store_id = store_ids.store_id
ORDER BY time DESC
LIMIT 20
) ON TRUE
ORDER BY time DESC
LIMIT 20;
Query plan
This query plan is much more efficient, accessing 119 buffer pages.
Limit (cost=26.21..26.26 rows=20 width=24) (actual time=0.284..0.287 rows=20 loops=1)
Output: items.id, items.time
Buffers: shared hit=119
-> Sort (cost=26.21..26.76 rows=220 width=24) (actual time=0.283..0.285 rows=20 loops=1)
Output: items.id, items.time
Sort Key: items.time DESC
Sort Method: top-N heapsort Memory: 27kB
Buffers: shared hit=119
-> Nested Loop (cost=0.43..20.35 rows=220 width=24) (actual time=0.055..0.247 rows=200 loops=1)
Output: items.id, items.time
Buffers: shared hit=119
-> Function Scan on pg_catalog.unnest store_ids (cost=0.00..0.11 rows=11 width=16) (actual time=0.005..0.006 rows=11 loops=1)
Output: store_ids.store_id
Function Call: unnest('{aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb,...}'::uuid[])
-> Limit (cost=0.43..1.44 rows=20 width=24) (actual time=0.011..0.019 rows=18 loops=11)
Output: items.id, items.time
Buffers: shared hit=119
-> Index Only Scan Backward using items_store_id_time_covering_id_idx on public.items (cost=0.43..5.48 rows=100 width=24) (actual time=0.010..0.017 rows=18 loops=11)
Output: items.id, items.time
Index Cond: (items.store_id = store_ids.store_id)
Heap Fetches: 20
Buffers: shared hit=119
Settings: effective_cache_size = '3052336kB', random_page_cost = '1.1'
Query Identifier: -8987254562252190725
Planning:
Buffers: shared hit=28
Planning Time: 0.254 ms
Execution Time: 0.321 ms
However, an even smarter query plan would be to load just 20 items from the each store_id
in sequence, keeping only the 20 items with the latest time
columns. I’d also prefer declarative SQL over imperative SQL (specifically, the for-each nature of lateral joins) to give the query planner more control, ideally for better performance.
Other attempts
I’ve also tried two alternative approaches, using ROW_NUMBER()
and a simulated loose indexscan, both of which performed poorly in this scenario as Postgres still read over ten thousand buffer pages.
Is there a (simple) way to get Postgres to generate a query plan that loads and sorts a minimal number of rows? Is the query planner and executor even capable of the "smarter query plan" described above?
2
Answers
Your method is probably the best way to go about it. A better plan would probably need something like an "index skip scan", which is not implemented in PostgreSQL. A patch has been proposed and went through 19 commitfests, but sadly it never landed, even though everybody expressed interest.
I can’t tell if you actually need it to be faster, or if you are just offended by imperfection. In the first case, you should share the plans with us so we could possibly recommend pragmatic improvements. (And in the second case, prepare to be disappointed).
You can get even better than your proposed solution, by reading not 20 rows for each store_id, but by reading only one more than is actually returned from each store_id. But getting that plan would require the query to be even worse than "imperative", it would need to be dynamically constructed.
Perhaps a more pragmatic solution would be to but add an index on
(time, store_id, id)
and then use your original query. However, this will do very badly if your array of store_ids in the query entirely consists of stores with no new items, only old ones. Even worse, PostgreSQL will not be able to detect that this is the case and so choose a different plan.