Complicated column dependencies in postgreSQL

BunnyBoss
May 14, 2024
220 views
0 votes
2 Answers

I have two columns: column1 with two distinct values (0, 1) and column2 with 3 distinct values (‘A’, ‘B’, ‘C’). Value in column1 is always 0 for ‘A’ and ‘B’ in column2, but if the value in column2 is ‘C’ then column1 has the next distribution: (0: 15%, 1: 85%). But I have another column departmentid. In some departments distribution of column1 in case column2 = ‘C’ is (0: 0%, 1: 100%). So in some cases when I have a query

SELECT * FROM mytable WHERE departmenid = 42 AND column2 = 'C' AND column1 = 0 ORDER BY id LIMIT 10;

PostgreSQL chooses to index scan by id and assumes there will be ~25000 rows, but in the table, there are no rows with this filter. So the query scans the whole table by index scan and it takes too long. If db selects bitmap scan it will be 50x times faster (based on other queries on this table). Indexes on all these columns are present.

I have two questions:

I created the statistic object on column1, column2 and departmentid, but there are no two interesting to me dependencies: from column2 to column1 and from column2, departmenid to column1. Why? In the statistic object on column1 and column2 dependencies are null. Of course I made analyze.
How can I speed up this query? Is there any way to speed up this query without creating a specific 4-column index (departmenid, column2, column1, id)? Because there are a lot of other filters and different orders in real production query (the problem reproduces with this minimal query).

PostgreSQL 16.1, auto -vacuum and -analyze every day, ~3 million rows in the table

update:
table definition (it has more than 100 columns and more than 50 indexes, show only related):

CREATE TABLE mytable
(
  id           SERIAL
    PRIMARY KEY,
  column1      INTEGER DEFAULT 0 NOT NULL,
  column2      TEXT              NOT NULL,
  departmentid INTEGER
);
CREATE INDEX mytable_departmentid_index
    ON mytable (departmentid);
CREATE INDEX mytable_column1_index
    ON mytable (column1);
CREATE INDEX mytable_column2_index
    ON mytable (column2);

-- tested statistic
CREATE STATISTICS mytable_column2_column1 ON column2, column1 FROM mytable;
CREATE STATISTICS mytable_departmentid_column1_column2 ON departmentid, column1, column2 FROM mytable;

departmentid – has < 0.0002% nulls, statistic says there are no nulls. Index on id exist due to id is primary key. All indexes are btree.

query plan (other columns are removed, numbers are actual):

Limit  (cost=0.68..501.79 rows=10 width=2690) (actual time=35351.049..47175.738 rows=1 loops=1)
  Output: id, column1, column2, departmentid
  Buffers: shared hit=1682274 read=1646793 dirtied=1640 written=980
  I/O Timings: shared/local read=39193.882 write=9.565
  WAL: records=1637 fpi=1637 bytes=3034199
  ->  Index Scan using mytable_pkey on public.mytable  (cost=0.68..1392081.41 rows=27780 width=2690) (actual time=35351.048..47175.735 rows=1 loops=1)
        Output: id, column1, column2, departmentid
        Filter: ((mytable.departmentid = 42) AND (mytable.column2 = 'C'::text) AND (mytable.column1 = 0))
        Rows Removed by Filter: 3536431
        Buffers: shared hit=1682274 read=1646793 dirtied=1640 written=980
        I/O Timings: shared/local read=39193.882 write=9.565
        WAL: records=1637 fpi=1637 bytes=3034199
Settings: effective_cache_size = '8008368kB', effective_io_concurrency = '0', geqo_effort = '10', jit = 'off', random_page_cost = '1.2', search_path = 'public'
Planning Time: 0.503 ms
Execution Time: 47175.783 ms

update 2:
Question 1 has my mistake, there is column2, departmenid to column1 dependency, but no column2 to column1. And also I was wrong that column1 is always 0 for column2 in (‘A’,’B’). Correct information is present in statistic, but that is not a big percent

Why the 4-column index is bad: The end user can select a column to order, so the index will just cover the default case. Filtering by column1 and column2 is also the default. So I prefer to find a more general solution

Question 3: Maybe there is some way to fix this by changing the DDL?

Question 4: Is there a way to calculate dependencies between value to get it by query? Like corr() function.

Information from pg_stats_ext (replaced all attribute number with name):

statistics_name	attnames	kinds	inherited	n_distinct	dependencies	most_common_vals	most_common_val_nulls	most_common_freqs	most_common_base_freqs
mytable_column2_column1	{column1,column2}	{d,f,m}	false	{"column1, column2": 6}	<null>	{{1,C},{0,C},{0,B},{1,B},{1,A},{0,A}}	{{false,false},{false,false},{false,false},{false,false},{false,false},{false,false}}	{0.80046,0.13586,0.02913,0.02236,0.00660,0.00556}	{0.77662,0.15970,0.00878,0.04271,0.01009,0.00207}
mytable_departmentid_column1_column2	{column1,departmentid,column2}	{d,f,m}	false	{"column1, departmentid": 1433, "column1, column2": 6, "departmentid, column2": 1492, "column1, departmentid, column2": 1784}	{"departmentid => column1": 0.137700, "departmentid => column2": 0.115367, "column1, departmentid => column2": 0.228133, "departmentid, column2 => column1": 0.226600}	{{1,44,C},{0,43,C},{1,42,C},{1,43,C}…	{{false,false,false},{false,false,false},{false,false,false},{false,false,false}…	{0.39790,0.06323,0.04683,0.02073…	{0.32481,0.01344,0.03789,0.06539…

Information from pg_stats:

attname	inherited	avg_width	n_distinct	most_common_vals	most_common_reqs	correlation	histogram_bounds
column1	false	4	2	{1,0}	{0.82943,017056}	0.64609	<null>
departmentid	false	4	1080	{44,43,42,45…	{0.41823,0.08420,0.04879,0.017466…	0.72736	{57,72,147,200…
column2	false	16	3	{C,B,A}	{0.93633,0.05150,0.01216}	0.32751	<null>

Answers

- jjanes
- May 9, 2024 at 3:17 am
- 0 votes
0
I don’t think there is an elegant way to fix this.

A brute force way to fix this query is to specify it as ORDER BY id+0. That will will force it not to use the primary key index to provide ordering and it would have to resort to something like the bitmap scan you alluded to. Of course there might be similar situations where using the primary key to provide ordering would actually be a good idea, and writing all queries in this way would prevent that from happening even when you would prefer it to happen.

Another option would be to just have the index (departmentid, id). This would not be as good as the 4 column index, but should still be much better than the existing query plan is. I don’t understand what your objection to the 4-column index is, so I don’t know if that objection would apply to this two-column index as well.

Login or Signup to reply.

- jjanes
- May 14, 2024 at 4:12 pm
- 0 votes
0
I think that the dependency stats are only used to increase the row count estimate, not to decrease it. For example if state=’Illinois’ and city=’Chicago’, it is used to know that it can’t multiply the two selectivities together and it (informally) just takes whichever of the two is most selective. But you have the opposite issue, it would be like state=’Illinois’ and city=’Los Angeles’. Instead of containing largely redundant information, they contain largely contradictory information, the PostgreSQL coding doesn’t use dependency information for that. Rather, it could use the MCV estimates for that but for that to work the sum of most_common_freqs would need to be close to one. In other words, in order to know that something absent from the MCV list is also absent from the table, the MCV list must be comprehensive. You should be able to achieve this by increasing the stats target. At least, that works for me–just increasing default_statistics_target TO 10000 and then analyzing the table puts the row estimates correct in my loose sim based on the 3-column MCV.

Another solution might be to partition the table. If you partition on departmentid, then just the two-column MCV on column1,column2 could work, at a much lower statistics target. But custom statistics don’t work the way I expected on partitioned tables, you need to build the statistic on each partition separately.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.