skip to Main Content

I have a table with ~35M rows, and trying to find "processed" records to remove from time to time. There are 14 valid statuses, and 10 of them are processed.

id uuid default uuid_generate_v4() not null primary key,
fk_id uuid not null references fk_table,
-- ... other columns
created_date timestamptz default now() not null,
status varchar(128) not null

Values for status can be one of a,b,c,d,e,f,g,h,i,j,k,l,m,n (14)

The index is on (status,created_date).
A query like:

select id from table
where created_date < 'somedate'
and status = ANY('{a,b,c,d,e,f,g,h,i,j}') -- one of first 10 values

The query planner insists on using a full seq_scan, instead of the index.

Is there a trick to make Postgres use the index for the status = ANY part of the predicate?

2

Answers


  1. If more than a few percent of rows qualify – or rather, if Postgres estimates as much – it will chose a sequential scan, which is faster for such a case.

    If, in fact, only few rows qualify, then your column statistics (and/or cost settings) are to blame for misleading estimates.

    If the cited index is only for purpose at hand, and only relatively few rows have a "processed" state, replace it with a partial index:

    CREATE INDEX foo ON tbl (created_date) WHERE status = ANY('{a,b,c..10}')
    

    Would make the index much smaller, the query faster, and the likelihood it gets used bigger.

    Either way, increasing the statistics target for created_date and status at least by a bit is most probably helpful. See:

    And more aggressive autovacuum settings for the table:

    Either way, for only "14 valid states", status varchar(128) seems tremendously wasteful.

    Also, the planner has gotten smarter for this in Postgres 16 as compared to Postgres 12. Postgres 16 detects common values even in input arrays with many elements (many more than you have distinct values in status), and switches the plan accordingly. Not sure about the old logic in Postgres 12, but for more than ? elements in the array, Postgres used to switch to generic estimates, which can generate poor results.

    But note, that either version can even adapt the plan for prepared statements, based on actual input.

    fiddle — pg 16
    fiddle — pg 12

    More depends on missing details …

    Login or Signup to reply.
  2. It probably just thinks the seq scan will be faster, and for all we know it is correct about it being faster. You can force it try both plans, by changing the setting of "enable_seqscan" and getting the EXPLAIN (ANALYZE, BUFFERS) under each setting. That way we can see which one is actually faster (run it multiple times each way to make sure the get a consistent timing for each rather than just a one-time fluke) and what its estimated rows and counted rows are, to see if those are discordant.

    If there is a column correlation so that those 10 statuses are selectively deficient in low-valued created_date, this skew might make it impossible to get good row count estimates. This type of skew is quite likely, given your description of how rows are removed from the table. And none of the currently implemented extended statistics types can likely fix this type of estimation problem.

    But regardless of that, if you build the correct index so that it can use an index-only scan, (status,created_date,id), it should actually be faster and should also be estimated to be faster, and so will likely use that index even if the row estimates remain wrong. This is more likely to work than the partial index suggested by Erwin, because (alas) PostgreSQL does not use the size of partial indexes as part of its estimation process. So even though the index might be small, PostgreSQL will not use that knowledge to guide it into using the partial index.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search