skip to Main Content

Consider an example table of items with three columns:

  • id (UUID, primary key)
  • time (TIMESTAMPTZ)
  • store_id (UUID,
    foreign key)

There is a covering B-tree index on (store_id, time) INCLUDING (id).

Finding the top-N items

I’d like to write a query that efficiently gets the 20 newest items across a specified set of stores. There may be tens or hundreds of thousands of rows per store_id. This is a relatively straightforward query that produces the correct result:

SELECT id, time
FROM items
WHERE store_id = ANY(ARRAY[
  'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa',
  'bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb',
  ...
]::uuid[])
ORDER BY time DESC
LIMIT 20;

However, according to the query plan from EXPLAIN (ANALYZE, BUFFERS), Postgres reads over ten thousand buffer pages despite the LIMIT 20 clause. Even with an SSD, this is slow. What appears to be happening is Postgres reads all the index entries before sorting them.

Query plan

This query plan was generated after this question was first posted and I had performed some other database optimizations like reindexing and vacuuming. The query runs reasonably quickly now but still accesses over 7900 buffer pages.

 Limit  (cost=553.97..554.02 rows=20 width=56) (actual time=214.618..214.624 rows=21 loops=1)
   Output: id, time
   Buffers: shared hit=7606 read=369 dirtied=1
   I/O Timings: read=205.188
   ->  Sort  (cost=553.97..574.15 rows=8073 width=56) (actual time=214.617..214.620 rows=20 loops=1)
         Output: id, time
         Sort Key: items.time DESC
         Sort Method: top-N heapsort  Memory: 28kB
         Buffers: shared hit=7606 read=369 dirtied=1
         I/O Timings: read=205.188
         ->  Index Only Scan using items_store_id_time_covering_id_idx on public.items  (cost=0.43..336.31 rows=8073 width=56) (actual time=0.730..211.783 rows=11920 loops=1)
               Output: id, time
               Index Cond: (items.store_id = ANY ('{aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb, ...}'::uuid[]))
               Heap Fetches: 282
               Buffers: shared hit=7606 read=369 dirtied=1
               I/O Timings: read=205.188
 Settings: effective_cache_size = '3052336kB', random_page_cost = '1.1'
 Query Identifier: -1648601102884428975
 Planning Time: 0.160 ms
 Execution Time: 214.647 ms

Lateral joins – 50-100x faster

A workaround that helps immensely is to use a lateral join to iteratively load only the 20 newest items per store_id and then take the newest 20 of them all. This loads just a couple hundred buffer pages and is about 50-100x more efficient for my test workload!

SELECT id, time
FROM unnest(ARRAY[
  'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa',
  'bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb',
  ...
  ]::uuid[]) AS store_ids(store_id)
JOIN LATERAL (
  SELECT id, time
  FROM items
  WHERE store_id = store_ids.store_id
  ORDER BY time DESC
  LIMIT 20
) ON TRUE
ORDER BY time DESC
LIMIT 20;

Query plan

This query plan is much more efficient, accessing 119 buffer pages.

Limit  (cost=26.21..26.26 rows=20 width=24) (actual time=0.284..0.287 rows=20 loops=1)
   Output: items.id, items.time
   Buffers: shared hit=119
   ->  Sort  (cost=26.21..26.76 rows=220 width=24) (actual time=0.283..0.285 rows=20 loops=1)
         Output: items.id, items.time
         Sort Key: items.time DESC
         Sort Method: top-N heapsort  Memory: 27kB
         Buffers: shared hit=119
         ->  Nested Loop  (cost=0.43..20.35 rows=220 width=24) (actual time=0.055..0.247 rows=200 loops=1)
               Output: items.id, items.time
               Buffers: shared hit=119
               ->  Function Scan on pg_catalog.unnest store_ids  (cost=0.00..0.11 rows=11 width=16) (actual time=0.005..0.006 rows=11 loops=1)
                     Output: store_ids.store_id
                     Function Call: unnest('{aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa,bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb,...}'::uuid[])
               ->  Limit  (cost=0.43..1.44 rows=20 width=24) (actual time=0.011..0.019 rows=18 loops=11)
                     Output: items.id, items.time
                     Buffers: shared hit=119
                     ->  Index Only Scan Backward using items_store_id_time_covering_id_idx on public.items  (cost=0.43..5.48 rows=100 width=24) (actual time=0.010..0.017 rows=18 loops=11)
                           Output: items.id, items.time
                           Index Cond: (items.store_id = store_ids.store_id)
                           Heap Fetches: 20
                           Buffers: shared hit=119
 Settings: effective_cache_size = '3052336kB', random_page_cost = '1.1'
 Query Identifier: -8987254562252190725
 Planning:
   Buffers: shared hit=28
 Planning Time: 0.254 ms
 Execution Time: 0.321 ms

However, an even smarter query plan would be to load just 20 items from the each store_id in sequence, keeping only the 20 items with the latest time columns. I’d also prefer declarative SQL over imperative SQL (specifically, the for-each nature of lateral joins) to give the query planner more control, ideally for better performance.

Other attempts

I’ve also tried two alternative approaches, using ROW_NUMBER() and a simulated loose indexscan, both of which performed poorly in this scenario as Postgres still read over ten thousand buffer pages.

Is there a (simple) way to get Postgres to generate a query plan that loads and sorts a minimal number of rows? Is the query planner and executor even capable of the "smarter query plan" described above?

2

Answers


  1. Your method is probably the best way to go about it. A better plan would probably need something like an "index skip scan", which is not implemented in PostgreSQL. A patch has been proposed and went through 19 commitfests, but sadly it never landed, even though everybody expressed interest.

    Login or Signup to reply.
  2. I can’t tell if you actually need it to be faster, or if you are just offended by imperfection. In the first case, you should share the plans with us so we could possibly recommend pragmatic improvements. (And in the second case, prepare to be disappointed).

    You can get even better than your proposed solution, by reading not 20 rows for each store_id, but by reading only one more than is actually returned from each store_id. But getting that plan would require the query to be even worse than "imperative", it would need to be dynamically constructed.

    (SELECT id, time FROM items WHERE store_id = 'a1...' ORDER BY time DESC) 
    union all 
    (SELECT id, time FROM items WHERE store_id = 'b7...' ORDER BY time DESC)
    -- etc, 
    ORDER BY time DESC LIMIT 20;
    

    Perhaps a more pragmatic solution would be to but add an index on (time, store_id, id) and then use your original query. However, this will do very badly if your array of store_ids in the query entirely consists of stores with no new items, only old ones. Even worse, PostgreSQL will not be able to detect that this is the case and so choose a different plan.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search