Postgresql - Using index better than sequential scan when every hundredth row is needed, but only with explicit list of values

AlwaysLearning
November 2, 2023
152 views
1 vote
2 Answers

I have a table (under RDS Postgres v. 15.4 instance db.m7g.large):

CREATE TABLE MyTable (
    content_id integer, 
    part integer, 
    vector "char"[]
);

There is a B-Tree index on content_id. My data consists of 100M rows. There are 1M (0 .. 10^6-1) different values of content_id. For each value of content_id there are 100 (0..99) values of part. The column vector contains an array of 384 byte-size numbers if content_id is divisible by 100 without a remainder. It is NULL otherwise.

I have constructed this artificial data to test performance of the following query submitted from a Python script (it will become clear in a moment why I left it in Python for the question):

query = f"""
WITH 
    Vars(key) as (
        VALUES (array_fill(1, ARRAY[{384}])::vector)
    ),
    Projection as (
        SELECT *
        FROM MyTable P
        WHERE P.content_id in ({str(list(range(0, 999999, 100)))[1:-1]})
    )
SELECT P.content_id, P.part
FROM Projection P, Vars
ORDER BY P.vector::int[]::vector <#> key
LIMIT {10};
"""

<#> is the dot product operator of the pgvector extension, and vector is the type defined by that extension, which to my understanding is similar to real[].

Note that the WHERE clause specifies an explicit list of 10K values of content_id (which correspond to 1M rows, i.e. every hundredth row in the table). Because of this large explicit list, I have to leave my query in Python and cannot run EXPLAIN ANALYZE.

The above query takes ~6 seconds to execute.

However, when I prepend this query with SET enable_seqscan = off; the query takes only ~3 seconds.

Question 1: Given that we need every 100-th row and that much of computation is about computing the dot products and ordering by them, how come sequential scans are not better than using the index? (All the more so, I can’t understand how using the index could result in an improvement by a factor of 2.)

Question 2: How come this improvement disappears if I change the explicit list of values for generate_series as shown below?

WHERE content_id IN (SELECT generate_series(0, 999999, 100))

Now, for this latter query I have the output for EXPLAIN ANALYZE:

 Limit  (cost=1591694.63..1591695.46 rows=10 width=24) (actual time=6169.118..6169.125 rows=10 loops=1)
   ->  Result  (cost=1591694.63..2731827.31 rows=13819790 width=24) (actual time=6169.117..6169.122 rows=10 loops=1)
         ->  Sort  (cost=1591694.63..1626244.11 rows=13819790 width=424) (actual time=6169.114..6169.117 rows=10 loops=1)
               Sort Key: ((((p.vector)::integer[])::vector <#> '[1,1,...,1]'::vector))
               Sort Method: top-N heapsort  Memory: 34kB
               ->  Nested Loop  (cost=194.30..1293053.94 rows=13819790 width=424) (actual time=2.629..6025.693 rows=1000000 loops=1)
                     ->  HashAggregate  (cost=175.02..177.02 rows=200 width=4) (actual time=2.588..5.321 rows=10000 loops=1)
                           Group Key: generate_series(0, 999999, 100)
                           Batches: 1  Memory Usage: 929kB
                           ->  ProjectSet  (cost=0.00..50.02 rows=10000 width=4) (actual time=0.002..0.674 rows=10000 loops=1)
                                 ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.001 rows=1 loops=1)
                     ->  Bitmap Heap Scan on mytable p  (cost=19.28..4204.85 rows=1382 width=416) (actual time=0.007..0.020 rows=100 loops=10000)
                           Recheck Cond: (content_id = (generate_series(0, 999999, 100)))
                           Heap Blocks: exact=64444
                           ->  Bitmap Index Scan on idx_content_on_mytable  (cost=0.00..18.93 rows=1382 width=0) (actual time=0.005..0.005 rows=100 loops=10000)
                                 Index Cond: (content_id = (generate_series(0, 999999, 100)))
 Planning Time: 0.213 ms
 Execution Time: 6169.260 ms
(18 rows)

UPDATE @jjanes commented regarding my first question:

Assuming your data is clustered on content_id, you need 100 consecutive rows out of every set of 10,000 rows. That is very different than needing every 100th row.

If I understand correctly, this means that each of the 10K look-ups of the index returns a range rather than 100 individual rows. That range can be then scanned sequentially.

Following are the outputs of EXPLAIN (ANALYZE, BUFFERS) for all three queries:

The original query:

 Limit  (cost=1430170.64..1430171.81 rows=10 width=16) (actual time=6300.232..6300.394 rows=10 loops=1)
   Buffers: shared hit=55868 read=436879
   I/O Timings: shared/local read=1027.617
   ->  Gather Merge  (cost=1430170.64..2773605.03 rows=11514348 width=16) (actual time=6300.230..6300.391 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=55868 read=436879
         I/O Timings: shared/local read=1027.617
         ->  Sort  (cost=1429170.62..1443563.55 rows=5757174 width=16) (actual time=6291.083..6291.085 rows=8 loops=3)
               Sort Key: ((((p.vector)::integer[])::vector <#> '[1,1,...,1]'::vector))
               Sort Method: top-N heapsort  Memory: 25kB
               Buffers: shared hit=55868 read=436879
               I/O Timings: shared/local read=1027.617
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
               ->  Parallel Seq Scan on mytable p  (cost=25.00..1304760.16 rows=5757174 width=16) (actual time=1913.156..6237.441 rows=333333 loops=3)
                     Filter: (content_id = ANY ('{0,100,...,999900}'::integer[]))
                     Rows Removed by Filter: 33000000
                     Buffers: shared hit=55754 read=436879
                     I/O Timings: shared/local read=1027.617
 Planning:
   Buffers: shared hit=149
 Planning Time: 8.444 ms
 Execution Time: 6300.452 ms
(24 rows)

The query with SET enable_seqscan = off;

 Limit  (cost=1578577.14..1578578.31 rows=10 width=16) (actual time=3121.539..3123.430 rows=10 loops=1)
   Buffers: shared hit=95578
   ->  Gather Merge  (cost=1578577.14..2922011.54 rows=11514348 width=16) (actual time=3121.537..3123.426 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=95578
         ->  Sort  (cost=1577577.12..1591970.05 rows=5757174 width=16) (actual time=3108.995..3108.997 rows=9 loops=3)
               Sort Key: ((((p.vector)::integer[])::vector <#> '[1,1,...,1]'::vector))
               Sort Method: top-N heapsort  Memory: 25kB
               Buffers: shared hit=95578
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
               ->  Parallel Bitmap Heap Scan on mytable p  (cost=184260.30..1453166.66 rows=5757174 width=16) (actual time=42.277..3057.887 rows=333333 loops=3)
                     Recheck Cond: (content_id = ANY ('{0,100,...,999900}'::integer[]))
                           Buffers: shared hit=40000
 Planning:
   Buffers: shared hit=149
 Planning Time: 8.591 ms
 Execution Time: 3123.638 ms
(23 rows)

Like 2, but with generate_series:

 Limit  (cost=1591694.63..1591694.66 rows=10 width=16) (actual time=6155.109..6155.114 rows=10 loops=1)
   Buffers: shared hit=104447
   ->  Sort  (cost=1591694.63..1626244.11 rows=13819790 width=16) (actual time=6155.107..6155.111 rows=10 loops=1)
         Sort Key: ((((p.vector)::integer[])::vector <#> '[1,1,...,1]'::vector))
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=104447
         ->  Nested Loop  (cost=194.30..1293053.94 rows=13819790 width=16) (actual time=2.912..6034.798 rows=1000000 loops=1)
               Buffers: shared hit=104444
               ->  HashAggregate  (cost=175.02..177.02 rows=200 width=4) (actual time=2.870..5.484 rows=10000 loops=1)
                     Group Key: generate_series(0, 999999, 100)
                     Batches: 1  Memory Usage: 929kB
                     ->  ProjectSet  (cost=0.00..50.02 rows=10000 width=4) (actual time=0.002..0.736 rows=10000 loops=1)
                           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.001 rows=1 loops=1)
               ->  Bitmap Heap Scan on mytable p  (cost=19.28..4204.85 rows=1382 width=416) (actual time=0.007..0.020 rows=100 loops=10000)
                     Recheck Cond: (content_id = (generate_series(0, 999999, 100)))
                     Heap Blocks: exact=64444
                     Buffers: shared hit=104444
                     ->  Bitmap Index Scan on idx_content_on_mytable  (cost=0.00..18.93 rows=1382 width=0) (actual time=0.005..0.005 rows=100 loops=10000)
                           Index Cond: (content_id = (generate_series(0, 999999, 100)))
                           Buffers: shared hit=40000
 Planning:
   Buffers: shared hit=180
 Planning Time: 1.012 ms
 Execution Time: 6155.251 ms
(24 rows)

Answers

Chosen as BEST ANSWER
- AlwaysLearning
- November 2, 2023 at 1:00 pm
- 0 votes
0
This is just a summary - a direct and short answer as I understood it for my two questions - for my own and others' use:
1. Each of the 10K look-ups of the index returns a range rather than 100 individual rows. That range can be then scanned sequentially.
2. Using generate_series instead of explicit list destroys concurrency.

(Edit)

- Quassnoi
- October 31, 2023 at 6:31 pm
- 0 votes
0
Given that we need every 100-th row and that much of computation is about computing the dot products and ordering by them, how come sequential scans are not better than using the index?

The dot product computation only happens after the records have been retrieved (regardless of the plan used to retrieve them). On my Google Cloud instance calculating dot products of 10k pairs of vectors and ordering by them takes 46 ms, so it’s not what’s adding the most overhead.

A sequential scan would need to evaluate every single record in the table and make sure it matches your WHERE condition.

An index scan would traverse your index first (which is much shorter than the heap) and only retrieve the heap pages that need to be retrieved. This incurs an additional cost of doing heap lookups in a loop (probably two or more times per page, if two records end up on the same page), which might or might not be offset by the fact there’s fewer heap page reads required.

The bitmap variety of the index scan tries to be smarter about the order in which it retrieves the heap pages (it first retrieves all pointers from the index, stores them in a bitmap, then retrieves all the pages in the heap order, skipping the ones it doesn’t need). The additional cost is maintaining the bitmap; the win is that each heap table needs only be retrieved once.

Your second query is also doing a redundant HashAggregate (to make sure the values coming out of GENERATE_SERIES are unique), but it’s not much of a problem.

In your plans:
1. Seq Scan: Buffers: shared hit=55754 read=436879, parallel with 2 workers
2. Bitmap Index Scan: Buffers: shared hit=95578 (of which 40000 are index), also parallel with 2 workers
3. GENERATE_SERIES: Buffers: shared hit=104444 (of which 40000 are index), sequential
The seq scan needs to do almost five times as many page reads as the index scan, and most of the pages are not cached in the shared buffers.

The bitmap index scan needs to read the index first and build the bitmap of the tuples that match the condition, until it fits in memory. Since your table is big, it switches to the bitmap of pages at some point, and performs an additional step of re-checking every tuple in the pages it fetches. The bitmap does require some resources to build, but it’s not too much, compared to the rest of the work the plan has to perform.

The one with GENERATE_SERIES is sequential, which halves your overall performance (because at this point, everything is cached and your workload is purely CPU and RAM-bound).

In PostgreSQL, the inner parts of the nested loops cannot be parallelized, and the outer part is a function call. You can replace the IN condition with a join to two unioned GENERATE_SERIES calls from 0 to 500k, and from 500k to 1M. This will likely generate a parallel append plan which will restore the parallelism, and as a side bonus will get rid of the unnecessary aggregation.

You can also observe a slight increase in heap buffer reads compared to your first query: 104444 vs 95578. This is because your this query builds the bitmaps multiple times, and some of the heap pages overlap (so that one page has entries for several index keys).
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Using index better than sequential scan when every hundredth row is needed, but only with explicit list of values

Answers