Postgresql - Slow postgres query with long Bitmap Heap Scan "I/O Timings: read" time

HappyLife
April 26, 2023
269 views
1 vote
2 Answers

Hi I have a database with 10 tables. each table has ~0.5-1 billion rows, and partitioned by range and then hash (10×10=100 partitions). they are newly created tables, and import data from csv files. they are indexed on the column (id) used for querying below. the database is hosted on a Azure PostgreSQL single server.

A test query showes most of the time is used on "I/O Timings: read":

postgres=> EXPLAIN (ANALYZE, BUFFERS) select count(*) from table_id4 where id=244730;
                                                                          QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=7458.09..7458.10 rows=1 width=8) (actual time=21141.393..21141.396 rows=1 loops=1)
   Buffers: shared read=2256
   I/O Timings: read=21096.814
   ->  Append  (cost=41.26..7452.66 rows=2171 width=0) (actual time=197.168..21138.495 rows=2247 loops=1)
         Buffers: shared read=2256
         I/O Timings: read=21096.814
         ->  Bitmap Heap Scan on table_id4_r2_h5  (cost=41.26..7441.80 rows=2171 width=0) (actual time=197.167..21137.471 rows=2247 loops=1)
               Recheck Cond: (id = 244730)
               Heap Blocks: exact=2247
               Buffers: shared read=2256
               I/O Timings: read=21096.814
               ->  Bitmap Index Scan on table_id4_r2_h5_id_idx  (cost=0.00..40.72 rows=2171 width=0) (actual time=117.586..117.586 rows=2247 loops=1)
                     Index Cond: (id = 244730)
                     Buffers: shared read=9
                     I/O Timings: read=116.929
 Planning Time: 2.882 ms
 Execution Time: 21141.449 ms
(17 rows)

I also try to ANALYZE table_name and then set enable_bitmapscan= OFF; according to this post. the query plan is now index ONLY, but still slow sometimes.

postgres=> EXPLAIN (ANALYZE, BUFFERS) select count(*) from table_id2 where id=179863;
                                                                                       QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=5528.78..5528.79 rows=1 width=8) (actual time=24273.748..24273.749 rows=1 loops=1)
   Buffers: shared read=1943
   I/O Timings: read=24218.684
   ->  Append  (cost=0.42..5523.93 rows=1940 width=0) (actual time=83.735..24271.326 rows=1959 loops=1)
         Buffers: shared read=1943
         I/O Timings: read=24218.684
         ->  Index Only Scan using table_id2_r10_h5_id_idx on table_id2_r10_h5  (cost=0.42..5514.23 rows=1940 width=0) (actual time=83.734..24270.157 rows=1959 loops=1)
               Index Cond: (id = 179863)
               Heap Fetches: 1959
               Buffers: shared read=1943
               I/O Timings: read=24218.684
 Planning Time: 3.254 ms
 Execution Time: 24273.787 ms
(13 rows)

[UPDATE] to answer @jjanes question on I/O speed:
This database is a "replication" of an existing azure db that can query <1sec on the same tables, as ANALYZE in my last post. I notice the difference on EXPLAIN result is, last db only Buffers: shared hit=4479 read=21, with have less read and more hit.
This azure db has the same price-tier as the old one as: Single Server, Basic, both with "IOPS: Variable". I only upgrade the storage from 600GB to 1TB; so I think I/O speed may not an issue?

I did a batch test and show quite varied query time in the same loop:

FOR idx IN SELECT (random()*total_IDs)::int AS id from generate_series (1,10)
LOOP ... 
select count(*) from table_id4 where id=idx;  
...
END LOOP;
NOTICE:   id: 321158 count#: 2154,   time: 46.734967s
NOTICE:   id: 487596 count#: 2238,   time: 0.968759s
NOTICE:   id: 548334 count#: 2180,   time: 1.062516s
NOTICE:   id: 404978 count#: 2179,   time: 29.750295s
NOTICE:   id: 370904 count#: 2123,   time: 22.203384s
NOTICE:   id: 228857 count#: 2223,   time: 29.094126s
NOTICE:   id: 327134 count#: 2169,   time: 24.750242s
NOTICE:   id: 372101 count#: 2180,   time: 28.062825s
NOTICE:   id: 341814 count#: 2130,   time: 30.250353s
NOTICE:   id: 248316 count#: 2195,   time: 32.375377s

but if I repeat the query on the same ids, then the time will get to ideal ms:

psql -c " ...
select count(*) from table_id4 where id=321158;
select count(*) from table_id4 where id=487596;
select count(*) from table_id4 where id=548334;
select count(*) from table_id4 where id=404978;
select count(*) from table_id4 where id=370904;
select count(*) from table_id4 where id=228857;
"
Time: 5267.168 ms (00:05.267)
Time: 171.925 ms
Time: 24.942 ms
Time: 11.387 ms
Time: 6.753 ms
Time: 17.573 ms

other tables behave similarly on query, and here is the table’s details,

postgres=> d+ table_id4
                          Unlogged table "table_id4"
   Column    |   Type   | Collation | Nullable | Default | Storage | Stats target | Description
-------------+----------+-----------+----------+---------+---------+--------------+-------------
 date        | date     |           | not null |         | plain   |              |
 field1      | real     |           |          |         | plain   |              |
 field2      | real     |           |          |         | plain   |              |
 field3      | smallint |           |          |         | plain   |              |
 id          | integer  |           | not null |         | plain   |              |
Partition key: RANGE (id)
Indexes:
    "table_id4_date_idx" btree (date)
    "table_id4_id_idx" btree (id)
Partitions: table_id4_r1 FOR VALUES FROM (0) TO (1...5), PARTITIONED,
            table_id4_r10 FOR VALUES FROM (...93) TO (MAXVALUE), PARTITIONED,
            table_id4_r2 FOR VALUES FROM (1...) TO (3...), PARTITIONED,
            table_id4_r3 FOR VALUES FROM (3...) TO (4...), PARTITIONED,
            table_id4_r4 FOR VALUES FROM (4...) TO (6...), PARTITIONED,
            table_id4_r5 FOR VALUES FROM (6...) TO (7...), PARTITIONED,
            table_id4_r6 FOR VALUES FROM (7...) TO (9...), PARTITIONED,
            table_id4_r7 FOR VALUES FROM (9...) TO (1...1), PARTITIONED,
            table_id4_r8 FOR VALUES FROM (1...1) TO (...32), PARTITIONED,
            table_id4_r9 FOR VALUES FROM (1...2) TO (...93), PARTITIONED

I see some similar discussion on here and here, but I am not very familiar with database and those tables are pretty large, so would like to understand more about the question before REINDEX or VACUUM etc, which may take days to complete.

the psql server resource monitor on the Azure portal show MAX usage (MAX usage in a given time window, not averaged over time) of [storage,memory,cpu]: 55%, 30%, <5%. so seems resource is not an issue?
some server parameters:

CPU: vCore 2
total memory: 4GB
storage: 1Tb
shared_buffers: 512MB
work_mem: 4MB  (changed to 256MB but still not work)
max_parallel_workers: 10
max_parallel_maintenance_workers: 8

does the LOGGED and ENABLE TRIGGER ALL will be helpful?

[update]: postgresql version 11

Answers

Chosen as BEST ANSWER

I have some update and think maybe better to post it separately.

A few tables after finishing the VACUUM FREEZE ANALYZE. my further test shows the slowness differ from select * from table_name where id=xxxx, to select count(*) from table_name where id=xxxx. the latter is very fast. (before VACUUM both queries are slow)

I EXPLAIN (ANALYZE, BUFFERS) the difference in many random id queries and list the result below.

psql -c "EXPLAIN (ANALYZE, BUFFERS) select * from table_name where id=xxxx ..." > query_all.log
psql -c "EXPLAIN (ANALYZE, BUFFERS) select count(*) from table_name where id=xxxx ..." > query_count.log

the id used are randomly generated and different from the above two lines. the output "query_all.log" and "query_count.log" has many blocks like below:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=611855.06..611863.92 rows=1670 width=20) (actual time=13496.814..13497.088 rows=1608 loops=1)
   Buffers: shared hit=59 read=1863
   I/O Timings: read=13452.313
   ->  Sort  (cost=611855.06..611859.49 rows=1772 width=20) (actual time=13496.812..13496.891 rows=1816 loops=1)
         Sort Key: table_id8_r1_h1.date
         Sort Method: quicksort  Memory: 134kB
         Buffers: shared hit=59 read=1863
         I/O Timings: read=13452.313
         ->  Nested Loop  (cost=279.44..611759.45 rows=1772 width=20) (actual time=503.656..13492.903 rows=1816 loops=1)
               Buffers: shared hit=56 read=1863
               I/O Timings: read=13452.313
               ->  Limit  (cost=245.47..245.47 rows=1 width=12) (actual time=259.470..259.472 rows=1 loops=1)
                     Buffers: shared hit=56 read=68
                     I/O Timings: read=257.437
                     ->  Sort  (cost=245.47..245.48 rows=4 width=12) (actual time=259.468..259.469 rows=1 loops=1)
                           Sort Key: ((abs((pts_id8.field1 - '44.5424623275445'::double precision)) + abs((pts_id8.field2 - '-82.4602969443103'::double precision))))
                           Sort Method: top-N heapsort  Memory: 25kB
                           Buffers: shared hit=56 read=68
                           I/O Timings: read=257.437
                           ->  Index Scan using pts_id8_field1_idx on pts_id8  (cost=0.43..245.45 rows=4 width=12) (actual time=54.573..259.424 rows=9 loops=1)
                                 Index Cond: ((field1 < '44.5474623275446'::double precision) AND (field1 > '44.5374623275446'::double precision))
                                 Filter: ((field2 < '-82.4552969443103'::double precision) AND (field2 > '-82.4652969443103'::double precision))
                                 Rows Removed by Filter: 5232
                                 Buffers: shared hit=53 read=68
                                 I/O Timings: read=257.437
               ->  Append  (cost=33.97..609719.47 rows=178564 width=16) (actual time=244.151..13229.906 rows=1816 loops=1)
                     Buffers: shared read=1795
                     I/O Timings: read=13194.876
... scan each partition ...
                     ->  Bitmap Heap Scan on table_id8_r6_h4  (cost=34.35..6304.84 rows=1795 width=16) (actual time=244.129..13228.557 rows=1816 loops=1)
                           Recheck Cond: (id = pts_id8.id)
                           Heap Blocks: exact=1787
                           Buffers: shared read=1795
                           I/O Timings: read=13194.876
                           ->  Bitmap Index Scan on table_id8_r6_h4_id_idx  (cost=0.00..33.90 rows=1795 width=0) (actual time=215.713..215.713 rows=1816 loops=1)
                                 Index Cond: (id = pts_id8.id)
                                 Buffers: shared read=8
                                 I/O Timings: read=214.867
...
 Planning Time: 4574.608 ms
 Execution Time: 13500.650 ms
(435 rows)

Since the log file is very long and I focus on the I/O here, I only list the Buffers: shared related lines in those ANALYZE logs:

$ grep 'Buffers: shared.*read|Execution Time: ' query_all.log
   Buffers: shared hit=62 read=1784
         Buffers: shared hit=62 read=1784
               Buffers: shared hit=59 read=1784
                     Buffers: shared hit=8 read=1784
                           Buffers: shared hit=8 read=1784
 Execution Time: 9174.216 ms
   Buffers: shared hit=135 read=1786
         Buffers: shared hit=135 read=1786
               Buffers: shared hit=135 read=1786
                     Buffers: shared hit=8 read=1786
                           Buffers: shared hit=8 read=1786
 Execution Time: 9075.526 ms
   Buffers: shared hit=58 read=1764
         Buffers: shared hit=58 read=1764
               Buffers: shared hit=58 read=1764
                     Buffers: shared hit=8 read=1764
                           Buffers: shared hit=8 read=1764
 Execution Time: 48.208 ms
   Buffers: shared hit=12 read=1821
         Buffers: shared hit=12 read=1821
               Buffers: shared hit=12 read=1821
                     Buffers: shared hit=12 read=38
                           Buffers: shared hit=12 read=38
                                 Buffers: shared hit=12 read=38
                                       Buffers: shared hit=12 read=35
                                             Buffers: shared hit=9 read=12
                                             Buffers: shared hit=3 read=23
                     Buffers: shared read=1783
                           Buffers: shared read=1783
                                 Buffers: shared read=8
 Execution Time: 10528.702 ms
   Buffers: shared hit=5 read=1796
         Buffers: shared hit=5 read=1796
               Buffers: shared hit=5 read=1796
                     Buffers: shared hit=4 read=44
                           Buffers: shared hit=4 read=44
                                 Buffers: shared hit=4 read=44
                                       Buffers: shared hit=4 read=41
                                             Buffers: shared hit=2 read=18
                                             Buffers: shared hit=2 read=23
                     Buffers: shared hit=1 read=1752
                           Buffers: shared hit=1 read=1752
                                 Buffers: shared hit=1 read=7
 Execution Time: 9520.122 ms
   Buffers: shared hit=32 read=1771
         Buffers: shared hit=32 read=1771
               Buffers: shared hit=32 read=1771
                     Buffers: shared hit=32 read=18
                           Buffers: shared hit=32 read=18
                                 Buffers: shared hit=32 read=18
                                       Buffers: shared hit=30 read=17
                                             Buffers: shared hit=3 read=17
                     Buffers: shared read=1753
                           Buffers: shared read=1753
                                 Buffers: shared read=8
 Execution Time: 8782.118 ms
   Buffers: shared hit=101 read=1811
         Buffers: shared hit=101 read=1811
               Buffers: shared hit=101 read=1811
                     Buffers: shared hit=100 read=26
                           Buffers: shared hit=100 read=26
                                 Buffers: shared hit=100 read=26
                     Buffers: shared hit=1 read=1785
                           Buffers: shared hit=1 read=1785
                                 Buffers: shared hit=1 read=6
 Execution Time: 191.813 ms
   Buffers: shared hit=15 read=1802
         Buffers: shared hit=15 read=1802
               Buffers: shared hit=15 read=1802
                     Buffers: shared hit=14 read=33
                           Buffers: shared hit=14 read=33
                                 Buffers: shared hit=14 read=33
                                       Buffers: shared hit=14 read=30
                                             Buffers: shared hit=2 read=16
                                             Buffers: shared hit=12 read=14
                     Buffers: shared hit=1 read=1769
                           Buffers: shared hit=1 read=1769
                                 Buffers: shared hit=1 read=7
   Buffers: shared hit=27 read=1750
         Buffers: shared hit=27 read=1750
               Buffers: shared hit=27 read=1750
                     Buffers: shared hit=27 read=21
                           Buffers: shared hit=27 read=21
                                 Buffers: shared hit=27 read=21
                                       Buffers: shared hit=27 read=18
                                             Buffers: shared hit=8 read=11
                                             Buffers: shared hit=19 read=7
                     Buffers: shared read=1729
                           Buffers: shared read=1729
                                 Buffers: shared read=8
 Execution Time: 274.640 ms
   Buffers: shared hit=44 read=1861
         Buffers: shared hit=44 read=1861
               Buffers: shared hit=44 read=1861
                     Buffers: shared hit=43 read=79
                           Buffers: shared hit=43 read=79
                                 Buffers: shared hit=43 read=79
                     Buffers: shared hit=1 read=1782
                           Buffers: shared hit=1 read=1782
                                 Buffers: shared hit=1 read=7
 Execution Time: 705.733 ms
   Buffers: shared hit=31 read=1816
         Buffers: shared hit=31 read=1816
               Buffers: shared hit=31 read=1816
                     Buffers: shared hit=30 read=19
                           Buffers: shared hit=30 read=19
                                 Buffers: shared hit=30 read=19
                                       Buffers: shared hit=27 read=19
                                             Buffers: shared hit=2 read=19
                     Buffers: shared hit=1 read=1797
                           Buffers: shared hit=1 read=1797
                                 Buffers: shared hit=1 read=6
 Execution Time: 11873.418 ms
...

and the count(*) log show much shorter execution time.

$ grep 'Buffers: shared.*read|Execution Time: ' query_count.log
   Buffers: shared hit=28 read=35
         Buffers: shared hit=28 read=35
               Buffers: shared hit=24 read=28
                     Buffers: shared hit=24 read=28
                           Buffers: shared hit=21 read=28
                                 Buffers: shared hit=21 read=25
                                       Buffers: shared hit=7 read=13
                                       Buffers: shared hit=14 read=12
               Buffers: shared hit=4 read=7
                     Buffers: shared hit=4 read=7
 Execution Time: 235.408 ms
   Buffers: shared hit=47 read=87
         Buffers: shared hit=47 read=87
               Buffers: shared hit=46 read=77
                     Buffers: shared hit=46 read=77
                           Buffers: shared hit=46 read=77
               Buffers: shared hit=1 read=10
                     Buffers: shared hit=1 read=10
 Execution Time: 625.582 ms
   Buffers: shared hit=44 read=89
         Buffers: shared hit=44 read=89
               Buffers: shared hit=44 read=79
                     Buffers: shared hit=44 read=79
                           Buffers: shared hit=44 read=79
               Buffers: shared read=10
                     Buffers: shared read=10
 Execution Time: 490.060 ms
   Buffers: shared hit=105 read=31
         Buffers: shared hit=105 read=31
               Buffers: shared hit=100 read=24
                     Buffers: shared hit=100 read=24
                           Buffers: shared hit=100 read=24
               Buffers: shared hit=5 read=7
                     Buffers: shared hit=5 read=7
 Execution Time: 275.391 ms
 Execution Time: 220.718 ms
   Buffers: shared hit=62 read=79
         Buffers: shared hit=62 read=79
               Buffers: shared hit=62 read=67
                     Buffers: shared hit=62 read=67
                           Buffers: shared hit=62 read=67
               Buffers: shared read=12
                     Buffers: shared read=12
 Execution Time: 633.331 ms
 Execution Time: 105.161 ms
   Buffers: shared hit=114 read=24
         Buffers: shared hit=114 read=24
               Buffers: shared hit=109 read=17
                     Buffers: shared hit=109 read=17
                           Buffers: shared hit=109 read=17
               Buffers: shared hit=5 read=7
                     Buffers: shared hit=5 read=7
 Execution Time: 156.873 ms
...

I run ~50 random batch test, the select * usually takes <1sec sometimes,or >~10sec some other times, not in between 1-10sec . the select count(*) takes 0.5sec usually and all <1sec.

If the slow IO is caused by Azure, I assume I should see a "even" variation in the execution time? say linearly ranging from 0-20sec? but here I only see pretty fast (<1sec) and very slow (>10sec), not in between. Is this possibly still some issue with my query plan not optimized?

for the difference of select * and select count(*), is the difference caused by less data transfer (e.g., Buffers: shared read=... written=xxx?) of the latter? or this post suggests that the count(*) skips the "heap and count entries in the index" ?

(Edit)

- jjanes
- April 26, 2023 at 11:03 pm
- 0 votes
0
so would like to understand more about the question before REINDEX or VACUUM etc, which may take days to complete

You are going to need a vacuum to happen at some point anyway, so I don’t see why you should delay doing one just as a learning exercise. A reindex on the other hand might never need to happen, and will lock your table while it going on, so I wouldn’t preemptively do one of those.

I guess you could wonder why autovac isn’t already taking care of it, and maybe a manual vacuum would destroy evidence around that question? Or since a manual vacuum is (by default) unthrottled, so doing one could lead to performance problems for other queries by using up the available resources.

For this case, I would expect it to be doing an index-only scan, and the only reason I can think of why it would not is because your pg_class.relallvisible for the table is very low relative to relpages (which in turn probably indicates that vacuum is not doing what it should). There are two ways to assess this other than running a preemptive vacuum. You could directly query pg_class and see what it says for relallvisible and relpages for the relevant tables, or you could force it do an index-only scan and see see what it reports for heap fetches in the explain (analyze, buffers) when it does one. For the later, doing set enable_bitmapscan=off will probably suffice to get it to do the index-only scan.

I see you added the bitmap off plan while I was typing, and indeed the heap fetches is as high as the rows returned, so the visibility map is not tuned up, so a vacuum is what you need to make that happen. Since you are using an old version (autovacuum_vacuum_insert_scale_factor was introduced in v13) and these are fresh tables (inserted/copied but never updated or deleted) a manual vacuum is the only way to get them vacuumed.

the psql server resource monitor on the Azure portal show MAX usage of [cpu,memory,storage]: 55%, 30%, <5%

I’m not familiar with this Azure portal. Is "storage" referring to capacity of storage (before it runs out of space), or performance of storage? If performance, does it refer to byte throughput performance (where 100% can only be obtained by sequential reads) or IOPS performance? If the latter, then maybe you would benefit by increasing effective_io_concurrency.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Slow postgres query with long Bitmap Heap Scan "I/O Timings: read" time

Answers