Postgresql getting wrong estimated rows when executing count(*) partitioned table with index only scan

HieuHoang
December 30, 2023
139 views
0 votes
2 Answers

I have a Azure Flexible Postgresql large partitioned table version 13 and I need to run count(*) from that table and compare with oracle performance. However I have some concerns after checking explain plan as below

Explain plan already used Index Only Scan to get data but always got wrong estimated rows (20954927 vs 16763942)

https://explain.depesz.com/s/CmTi

-> Parallel Index Only Scan using stat_201408_acct_id_idx on stat_201408 (cost=0.44..579554.83 rows=20954927 width=0) (actual time=0.017..1218.683 rows=16763942 loops=3)

SELECT reltuples::bigint, relpages, relallvisible FROM pg_class WHERE relname = ‘stat_201408’;

reltuples | relpages | relallvisible

———–+———-+—————

50291824 | 470018 | 470018

select count(*) from stat_201408;

50291825
(1 row)

Even I tried to increase statistic on column acct_id to 3000 but it doesn’t help

ALTER TABLE stat_201408 ALTER COLUMN acct_id SET STATISTICS 3000;

VACUUM ANALYZE verbose stat_201408;

I only was testing on a partition so It will be big difference on whole partitioned table

On this table, I have some indexes as below but I am not clear why it always used stat_idx1 (column acct_id) with lower than stat_efftv_dt_acct_id_idx

Indexes:

"stat_batch_id_idx" btree (batch_id)
"stat_efftv_dt_acct_id_idx" btree (efftv_dt, acct_id)
"stat_idx1" btree (acct_id)
"stat_idx7" btree (batch_id, stat_cd)

Below are number of distinct values on some columns

SELECT count(DISTINCT acct_id), count(DISTINCT efftv_dt),count(DISTINCT batch_id) FROM wmcb_int.acct_stat_201408;

acct_id| efftv_dt| batch_id

———+——-+——-

1271151 | 21 | 21

Statistics on some columns of this partition

tablename     |   attname    | total_rows | correlation

——————+————–+————+————-

stat_201408 | batch_id | 50291824 | 0.9995212

stat_201408 | efftv_dt | 50291824 | 0.9995212

stat_201408 | acct_stat_cd | 50291824 | 0.051139828

stat_201408 | acct_id | 50291824 | 0.044063628

(4 rows)

My questions are

How to fix wrong estimated rows in explain plan
How to force to use a particular index on Postgresql query by HINT as Oracle
Will number of distinct values of a column will decide which index should be chosen ? How to find out which columns need to be indexed to get best performance while count(*)

Thanks so much

Tags: performance postgresql

Answers

Chosen as BEST ANSWER
- HieuHoang
- December 30, 2023 at 6:40 am
- 0 votes
0
I tried to force using stat_efftv_dt_acct_id_idx (column efftv_dt, acct_id) by the way as below and it actually got better actual execution time however it was rollback all after completing query and read from disk IO in the next run which not being utilized shared buffer hit from last run My concern is that I dont want to drop "stat_idx1" (acct_id) because it need to be used for some queries with predicate acct_id.

begin;

drop index stat_idx1;

EXPLAIN (analyze, verbose, costs, timing, buffers) select count(*) from stat_201408;

rollback;

(Edit)

- jjanes
- December 29, 2023 at 7:41 pm
- 0 votes
0
When a parallel query is run, it is assumed not all workers will process exactly the same number of rows, because rows are not divvied up one by one but rather by blocks or ranges of blocks; and in general not all blocks have the same number of active rows. The planner accounts for this by inflating the estimated number of rows, on the assumption that the total query time will be driven by the slowest worker and the slowest worker will be the one with the most rows.

This is operating as designed. There is no way to (nor reason to, as far as I can tell) "fix" it apart from hacking up the code and recompiling.

There is a third party extension, pg_hint_plan, you could use to add oracle-style hints to PostgreSQL. Although it is hard to see what you hope to accomplish by doing so in this case.

If you don’t want to install pg_hint_plan, you could test alternative indexes just by dropping the one you don’t want to used and seeing what it does instead. If you don’t want to go through the work of rebuilding the dropped index, you could just drop it in a transaction, run the EXPLAIN, then roll back the drop.

It is hard to see what you hope to accomplish by all of this. Transitioning from Oracle to PostgreSQL will raise enough real trouble of its own, there is no reason to go about dreaming up imaginary ones.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.