skip to Main Content

I have a table store_record with 45 million records. I want to run a simple count query on the largest database_id. Note I have an index on database_id.

SELECT COUNT(*) FROM store_record WHERE database_id='123';
-- returns ~17.2 million

The query took 3 minutes! See the query plan below, which I generated by prepending explain (analyze, buffers, verbose, settings) to the query:

Finalize Aggregate  (cost=3063219.25..3063219.25 rows=1 width=8) (actual time=178805.800..178899.302 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=174202 read=2786089
  I/O Timings: read=336637.165
  ->  Gather  (cost=3063219.15..3063219.25 rows=1 width=8) (actual time=178805.612..178899.288 rows=2 loops=1)
        Output: (PARTIAL count(*))
        Workers Planned: 1
        Workers Launched: 1
        JIT for worker 0:
          Functions: 4
"          Options: Inlining true, Optimization true, Expressions true, Deforming true"
"          Timing: Generation 0.688 ms, Inlining 68.060 ms, Optimization 20.002 ms, Emission 17.390 ms, Total 106.140 ms"
        Buffers: shared hit=174202 read=2786089
        I/O Timings: read=336637.165
        ->  Partial Aggregate  (cost=3062219.15..3062219.15 rows=1 width=8) (actual time=178781.061..178781.062 rows=1 loops=2)
              Output: PARTIAL count(*)
              Buffers: shared hit=174202 read=2786089
              I/O Timings: read=336637.165
              Worker 0: actual time=178756.791..178756.793 rows=1 loops=1
                Buffers: shared hit=86992 read=1397345
                I/O Timings: read=168337.781
              ->  Parallel Seq Scan on public.store_record  (cost=0.00..3056983.48 rows=10471335 width=0) (actual time=140.886..178023.778 rows=8784825 loops=2)
"                    Output: id, key, data, created_at, updated_at, database_id, organization_id, user_id"
                    Filter: (store_record.database_id = '7e28da88-ea52-451a-8611-eb9a60dbc15e'::uuid)
                    Rows Removed by Filter: 14472533
                    Buffers: shared hit=174202 read=2786089
                    I/O Timings: read=336637.165
                    Worker 0: actual time=110.506..177990.918 rows=8816662 loops=1
                      Buffers: shared hit=86992 read=1397345
                      I/O Timings: read=168337.781
"Settings: cpu_index_tuple_cost = '0.001', cpu_operator_cost = '0.0005', cpu_tuple_cost = '0.003', effective_cache_size = '10980000kB', max_parallel_workers_per_gather = '1', random_page_cost = '2', search_path = '""$user"", public, heroku_ext', work_mem = '100MB'"
Planning Time: 0.087 ms
JIT:
  Functions: 10
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 1.295 ms, Inlining 152.272 ms, Optimization 86.675 ms, Emission 36.935 ms, Total 277.177 ms"
Execution Time: 178900.033 ms

To test if postgres caching would help, I repeated the same query two more times and got the same result.

Then, just messing around, I ran VACUUM ANALYZE store_record which took 15 min. Then repeated the same query as above. It only took 2.7 seconds and the query plan looks much different.

Finalize Aggregate  (cost=234344.55..234344.55 rows=1 width=8) (actual time=2538.619..2559.099 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=270505
  ->  Gather  (cost=234344.44..234344.55 rows=1 width=8) (actual time=2538.472..2559.087 rows=2 loops=1)
        Output: (PARTIAL count(*))
        Workers Planned: 1
        Workers Launched: 1
        JIT for worker 0:
          Functions: 3
"          Options: Inlining false, Optimization false, Expressions true, Deforming true"
"          Timing: Generation 0.499 ms, Inlining 0.000 ms, Optimization 0.193 ms, Emission 3.403 ms, Total 4.094 ms"
        Buffers: shared hit=270505
        ->  Partial Aggregate  (cost=233344.44..233344.45 rows=1 width=8) (actual time=2516.493..2516.494 rows=1 loops=2)
              Output: PARTIAL count(*)
              Buffers: shared hit=270505
              Worker 0: actual time=2494.746..2494.747 rows=1 loops=1
                Buffers: shared hit=131826
              ->  Parallel Index Only Scan using store_record_database_updated_at_a4646b_idx on public.store_record  (cost=0.11..228252.85 rows=10183195 width=0) (actual time=0.045..1749.091 rows=8637277 loops=2)
"                    Output: database_id, updated_at"
                    Index Cond: (store_record.database_id = '7e28da88-ea52-451a-8611-eb9a60dbc15e'::uuid)
                    Heap Fetches: 0
                    Buffers: shared hit=270505
                    Worker 0: actual time=0.068..1732.100 rows=8420237 loops=1
                      Buffers: shared hit=131826
"Settings: cpu_index_tuple_cost = '0.001', cpu_operator_cost = '0.0005', cpu_tuple_cost = '0.003', effective_cache_size = '10980000kB', max_parallel_workers_per_gather = '1', random_page_cost = '2', search_path = '""$user"", public, heroku_ext', work_mem = '100MB'"
Planning Time: 0.092 ms
JIT:
  Functions: 8
"  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"  Timing: Generation 0.981 ms, Inlining 0.000 ms, Optimization 0.326 ms, Emission 6.527 ms, Total 7.835 ms"
Execution Time: 2559.655 ms

The latter plan looks much more optimal: using a Index Only Scan instead of a Sequential Scan.

Some key notes:

  • I’m on postgres 12.14
  • The store_record table is read from frequently. New Relic shows ~700 queries per minute
  • The store_record table is written to frequently. New Relic shows ~350 queries per minute

A few questions:

  • Why would postgres use a sequential scan in the first case, instead of utilizing the index? That seems so incorrect.
  • Is the VACUUM ANALYZE responsible for the better plan / performance improvement?
  • If so, why did I have to manually run it? Why didn’t an autovacuum hit it?
  • Should I look into tuning autovacuum to run more regularly? Note, the following query showed it was run about 20 ago:
SELECT last_autovacuum FROM pg_stat_all_tables
WHERE schemaname = 'public' and relname='store_record';

enter image description here

2

Answers


  1. Surprising as it looks at first glance, the speed-up is really mostly due to VACUUM.

    The reason lies in PostgreSQL’s implementation of multi-versioning: each table entry stores visibility information (in the shape of xmin and xmax), because not every transaction can see every entry. This is to maintain a consistent view of the data in the face of concurrent data modifications, a property of the transaction system known as isolation.

    As your second execution plan shows, the fastest way to count rows is an index-only scan that doesn’t even read the table. But owing to the above, that is not always possible in PostgreSQL: you can only count rows that are visible to your transaction, and that visibility information is only stored in the table and is not available in the index (this is to avoid redundancy and, more importantly, to keep the index as small and, as a consequence, as efficient as possible).

    That sounds like an index-only scan should be impossible, but PostgreSQL has a way around that problem: each table has a visibility map, a tiny data structure that contains two bits per 8kB block of the table. One of these bits, the “all-visible bit”, indicates if all active transactions can see all entries in this table block or not. If the block containing a row is all-visible, an index scan can skip fetching the row from the table to test its visibility and can become a true index-only scan.

    Now that visibility map is maintained by the process that cleans up “garbage” in a table, and that process is VACUUM. So you can only get an efficient index-only scan if a table has been vacuumed recently. The query optimizer is aware of that and chooses the appropriate plan.

    The upshot of all that is that if index-only scans are important for you, you should reduce autovacuum_vacuum_scale_factor for that table, so that it is vacuumed more frequently. On the other hand, a sensible database workload won’t count the rows in a large table frequently, so you might choose not to do that to conserve resources.


    There are two other aspects that conspire to exacerbate the surprising performance difference:

    • the second query that you run will find more data cached, so it is typically faster

    • you have set random_page_cost twice as high as seq_page_cost, so the optimizer estimates that sequential I/O is twice as fast as random I/O

    Login or Signup to reply.
  2. As Laurenz indicates, it is almost certainly due to the VACUUM tuning up the visibility map, which makes the index only scan both be faster, as well as estimate as faster. Had it chosen an index-only scan when the visibility map had not been tuned up, it probably would have been even slower than the seq scan, which is why it did not choose it. (you could have proven that by forcing the index scan by setting enable_seqscan=off, but after vacuuming it is too late to do that experiment)

    You said there are about ~350 write queries per minute, but it makes a big difference whether those are inserts or updates. inserts do not create dead tuples (unless they rollback rather than commit) and so do not cause autovac to kick in (until v13, which introduced autovacuum_vacuum_insert_* settings just for this purpose, but you are on v12). If those are updates, and each affect one row, then in 20 days there would be about 10e6 of them which would just barely be enough to cause an autovac on a 45e6 row table with the default settings. So it is likely an autovac was just about to happen, or was already underway, when you did the manual vacuum.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search