I have two columns: column1 with two distinct values (0, 1) and column2 with 3 distinct values (‘A’, ‘B’, ‘C’). Value in column1 is always 0 for ‘A’ and ‘B’ in column2, but if the value in column2 is ‘C’ then column1 has the next distribution: (0: 15%, 1: 85%). But I have another column departmentid. In some departments distribution of column1 in case column2 = ‘C’ is (0: 0%, 1: 100%). So in some cases when I have a query
SELECT * FROM mytable WHERE departmenid = 42 AND column2 = 'C' AND column1 = 0 ORDER BY id LIMIT 10;
PostgreSQL chooses to index scan by id and assumes there will be ~25000 rows, but in the table, there are no rows with this filter. So the query scans the whole table by index scan and it takes too long. If db selects bitmap scan it will be 50x times faster (based on other queries on this table). Indexes on all these columns are present.
I have two questions:
- I created the statistic object on column1, column2 and departmentid, but there are no two interesting to me dependencies: from column2 to column1 and from column2, departmenid to column1. Why? In the statistic object on column1 and column2 dependencies are null. Of course I made analyze.
- How can I speed up this query? Is there any way to speed up this query without creating a specific 4-column index (departmenid, column2, column1, id)? Because there are a lot of other filters and different orders in real production query (the problem reproduces with this minimal query).
PostgreSQL 16.1, auto -vacuum and -analyze every day, ~3 million rows in the table
update:
table definition (it has more than 100 columns and more than 50 indexes, show only related):
CREATE TABLE mytable
(
id SERIAL
PRIMARY KEY,
column1 INTEGER DEFAULT 0 NOT NULL,
column2 TEXT NOT NULL,
departmentid INTEGER
);
CREATE INDEX mytable_departmentid_index
ON mytable (departmentid);
CREATE INDEX mytable_column1_index
ON mytable (column1);
CREATE INDEX mytable_column2_index
ON mytable (column2);
-- tested statistic
CREATE STATISTICS mytable_column2_column1 ON column2, column1 FROM mytable;
CREATE STATISTICS mytable_departmentid_column1_column2 ON departmentid, column1, column2 FROM mytable;
departmentid – has < 0.0002% nulls, statistic says there are no nulls. Index on id exist due to id is primary key. All indexes are btree.
query plan (other columns are removed, numbers are actual):
Limit (cost=0.68..501.79 rows=10 width=2690) (actual time=35351.049..47175.738 rows=1 loops=1)
Output: id, column1, column2, departmentid
Buffers: shared hit=1682274 read=1646793 dirtied=1640 written=980
I/O Timings: shared/local read=39193.882 write=9.565
WAL: records=1637 fpi=1637 bytes=3034199
-> Index Scan using mytable_pkey on public.mytable (cost=0.68..1392081.41 rows=27780 width=2690) (actual time=35351.048..47175.735 rows=1 loops=1)
Output: id, column1, column2, departmentid
Filter: ((mytable.departmentid = 42) AND (mytable.column2 = 'C'::text) AND (mytable.column1 = 0))
Rows Removed by Filter: 3536431
Buffers: shared hit=1682274 read=1646793 dirtied=1640 written=980
I/O Timings: shared/local read=39193.882 write=9.565
WAL: records=1637 fpi=1637 bytes=3034199
Settings: effective_cache_size = '8008368kB', effective_io_concurrency = '0', geqo_effort = '10', jit = 'off', random_page_cost = '1.2', search_path = 'public'
Planning Time: 0.503 ms
Execution Time: 47175.783 ms
update 2:
Question 1 has my mistake, there is column2, departmenid to column1 dependency, but no column2 to column1. And also I was wrong that column1 is always 0 for column2 in (‘A’,’B’). Correct information is present in statistic, but that is not a big percent
Why the 4-column index is bad: The end user can select a column to order, so the index will just cover the default case. Filtering by column1 and column2 is also the default. So I prefer to find a more general solution
Question 3: Maybe there is some way to fix this by changing the DDL?
Question 4: Is there a way to calculate dependencies between value to get it by query? Like corr() function.
Information from pg_stats_ext (replaced all attribute number with name):
statistics_name | attnames | kinds | inherited | n_distinct | dependencies | most_common_vals | most_common_val_nulls | most_common_freqs | most_common_base_freqs |
---|---|---|---|---|---|---|---|---|---|
mytable_column2_column1 | {column1,column2} | {d,f,m} | false | {"column1, column2": 6} | <null> | {{1,C},{0,C},{0,B},{1,B},{1,A},{0,A}} | {{false,false},{false,false},{false,false},{false,false},{false,false},{false,false}} | {0.80046,0.13586,0.02913,0.02236,0.00660,0.00556} | {0.77662,0.15970,0.00878,0.04271,0.01009,0.00207} |
mytable_departmentid_column1_column2 | {column1,departmentid,column2} | {d,f,m} | false | {"column1, departmentid": 1433, "column1, column2": 6, "departmentid, column2": 1492, "column1, departmentid, column2": 1784} | {"departmentid => column1": 0.137700, "departmentid => column2": 0.115367, "column1, departmentid => column2": 0.228133, "departmentid, column2 => column1": 0.226600} | {{1,44,C},{0,43,C},{1,42,C},{1,43,C}… | {{false,false,false},{false,false,false},{false,false,false},{false,false,false}… | {0.39790,0.06323,0.04683,0.02073… | {0.32481,0.01344,0.03789,0.06539… |
Information from pg_stats:
attname | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_reqs | correlation | histogram_bounds |
---|---|---|---|---|---|---|---|---|
column1 | false | 0 | 4 | 2 | {1,0} | {0.82943,017056} | 0.64609 | <null> |
departmentid | false | 0 | 4 | 1080 | {44,43,42,45… | {0.41823,0.08420,0.04879,0.017466… | 0.72736 | {57,72,147,200… |
column2 | false | 0 | 16 | 3 | {C,B,A} | {0.93633,0.05150,0.01216} | 0.32751 | <null> |
2
Answers
I don’t think there is an elegant way to fix this.
A brute force way to fix this query is to specify it as
ORDER BY id+0
. That will will force it not to use the primary key index to provide ordering and it would have to resort to something like the bitmap scan you alluded to. Of course there might be similar situations where using the primary key to provide ordering would actually be a good idea, and writing all queries in this way would prevent that from happening even when you would prefer it to happen.Another option would be to just have the index
(departmentid, id)
. This would not be as good as the 4 column index, but should still be much better than the existing query plan is. I don’t understand what your objection to the 4-column index is, so I don’t know if that objection would apply to this two-column index as well.I think that the dependency stats are only used to increase the row count estimate, not to decrease it. For example if state=’Illinois’ and city=’Chicago’, it is used to know that it can’t multiply the two selectivities together and it (informally) just takes whichever of the two is most selective. But you have the opposite issue, it would be like state=’Illinois’ and city=’Los Angeles’. Instead of containing largely redundant information, they contain largely contradictory information, the PostgreSQL coding doesn’t use dependency information for that. Rather, it could use the MCV estimates for that but for that to work the sum of most_common_freqs would need to be close to one. In other words, in order to know that something absent from the MCV list is also absent from the table, the MCV list must be comprehensive. You should be able to achieve this by increasing the stats target. At least, that works for me–just increasing default_statistics_target TO 10000 and then analyzing the table puts the row estimates correct in my loose sim based on the 3-column MCV.
Another solution might be to partition the table. If you partition on departmentid, then just the two-column MCV on column1,column2 could work, at a much lower statistics target. But custom statistics don’t work the way I expected on partitioned tables, you need to build the statistic on each partition separately.