skip to Main Content

I have a Postgres table that has numerous columns which frequently show up in the where clause of select queries. The table has been indexed accordingly, with indexes on all of these columns (mostly single-column indexes, but with some composite indexes thrown in). However, there is one new kind of query that this indexing isn’t fully supporting: queries with a deleted_at is null condition in the where clause (we soft-delete records using this column). Some queries with this are running very slowly despite all of the other columns they use being indexed. Naturally, I want to find a way to improve these queries with a change to our indexing.

An example would be:

select count(distinct user_id)
from my_table
where group_id = '123' and deleted_at is null

In this example, both user_id and group_id are indexed. Without and deleted_at is null, the query runs quickly. With it, slowly.

I have four competing solutions. I plan on testing them, but I really want to know if any old hands are able to look at this situation and have a simple explanation for why one should be expected to perform better than the other. I’m just getting the hang of thinking about indexing after being spoiled by Snowflake for years so I’m really looking for how one would reason about this.

My solutions:

  1. An index on the expression (docs) deleted_at is null. Basically: CREATE INDEX deleted_at_is_null ON my_table ((deleted_at is null));. This is the simplest solution. It’s just one more index, and one with a clear purpose. I’m not sure, though, if it should actually be expected to help in queries where we have other indexed columns in the where clause! Can Postgres use them separately or do they need to be composite?
  2. Replace each of the current indexes (like the ones on user_id and group_id above) with composite indexes on that column, plus deleted_at is null.
  3. Same as 2, but instead of replacing the indexes, add the composite indexes in addition to the currently-existing indexes. This feels wrong and redundant, but I am not sure.
  4. Add a new partial index for each of the currently-existing indexes with a where deleted_at is not null condition. Like number 3, this feels like too many indexes.

I’m assuming that an index on deleted_at itself is overkill since I never need to query for specific ranges/values of it – only for whether it is null. Please correct me if I am wrong, though!

One other thing to note is that the vast majority of the records have null deleted_at.

Any help would be much appreciated! Just looking for some intuition and best practices around this problem.

2

Answers


  1. PostgreSQL will generally only use one index per table. If you have single column indexes, it must choose only one. In your example, the query planner has to choose whether using the user_id, group_id or deleted_at index will be most performant. Which it chooses depends on the shape of your data, and whether your table statistics are up to date (run analyze my_table to make sure).

    For example, if half the rows are deleted using an index on deleted_at would only reduce the number of rows to search by half. But if only a small fraction are in group 123, it will choose to use the index on group_id and then scan those for deleted_at is null and distinct user_id.


    You can be more efficient about creating indexes for multiple columns by taking advantage of how composite indexes work. An index on (a, b, c) can cover queries which include a, queries which include a and b, and queries which include a, b, and c. It cannot cover queries which include only b or c.

    For example, to cover every combination of deleted_at and group_id and user_id in you’d need three indexes.

    • (deleted_at, user_id, group_id)
      • covers deleted_at, deleted_at + user_id, and deleted_at + user_id + group_id
    • (group_id, deleted_at)
      • covers group_id, group_id + deleted_at
    • (user_id, group_id)
      • covers user_id, user_id + group_id

    But you said you had a lot of columns, so the number of indexes can expand rapidly. And since where deleted_at is null is likely to be used in most queries it makes more sense to partition your table by deleted_at is null. This will create two tables which appear to be one: one table has deleted rows, the other has active rows. If you include where deleted_at is null in a query PostgreSQL will simply query the appropriate partition, leaving it to choose indexes for other columns. It also makes it more efficient to remove "deleted" rows without blocking other queries.

    You can’t partition an existing table, so you have to make a new table, partition it, and copy your data over.

    The downside is if you have a primary key the partition key has to part of it. And nulls aren’t allowed in primary keys. So you’d have to change your strategy to use a special date like 9999-01-01. For convenience and safety, create a view which only selects non-deleted rows.

    -- Move the old table out of the way.
    alter table things rename to things_original;
    
    -- Change the undeleted rows.
    update things_original
    set deleted_at = '9999-01-01' 
    where deleted_at is null;
    
    -- Create the new partitioned table
    -- deleted_at must be part of the primary key
    -- this allows IDs to be reused, which may or may not be what you want
    create table things_all (
      id serial,
      name text not null,
      deleted_at timestamp default '9999-01-01',
      primary key(id, deleted_at)
    ) partition by list(deleted_at);
    
    -- Make a partition for the active rows.
    create table things_active partition of things_all for values in ('9999-01-01');
    
    -- And one for the inactive rows.
    create table things_deleted partition of things_all default;
    
    -- Copy rows from the old table. They'll automatically be partitioned.
    insert into things_all select * from things_original;
    
    -- Add your indexes to things_all (not shown)
    
    -- Create a view to only get active rows.
    create view things as 
    select * 
    from things_all 
    where deleted_at = '9999-01-01';
    
    -- This will only query the things_active table, no deleted_at index required.
    select * from things;
    

    Demonstration

    Login or Signup to reply.
  2. Not really an answer, just an addition to your list:

    1. Replace the indexes you have with their partial versions, if your queries always want the (deleted_at is null). Otherwise, mismatching the predicate disqualifies the partial index entirely.
    2. Don’t add (deleted_at is null) as either a key column or a partial index predicate but rather strap deleted_at on as payload using include.

    The former is a missing combination of those you already established, the latter could work sort of against what the documentation clearly says about non-key column inclusion:

    A non-key column cannot be used in an index scan search qualification.

    And it is not used in the qualification, but it is used in the scan, speeding things up by saving a whole subsequent heap scan. If you just add deleted_at, Postgres still prefers a plain index on group_id, then a re-check on the heap because it needs to consult both deleted_at as well as user_id it’s looking for.
    If you add both as payload:

    create index idx2 on my_table(group_id)include(user_id,deleted_at);
    

    Everything is in the index. Now Postgres sees deleted_at is in the index it’s already using, so both the output and the filter can re-use that:
    demo at db<>fiddle

    QUERY PLAN
    Aggregate (cost=4.50..4.51 rows=1 width=8) (actual time=0.071..0.072 rows=1 loops=1)
    Output: count(DISTINCT user_id)
    -> Index Only Scan using idx2 on public.my_table (cost=0.42..4.49 rows=1 width=4) (actual time=0.055..0.057 rows=2 loops=1)
    Output: group_id, user_id, deleted_at
    Index Cond: (my_table.group_id = 123)
    Filter: (my_table.deleted_at IS NULL)
    Rows Removed by Filter: 2
    Heap Fetches: 0
    Planning Time: 0.108 ms
    Execution Time: 0.095 ms

    That’s on 100k random group_id‘s and user_id‘s spread over 300k rows with 20% deleted_at IS NULL.


    1. While an inert column in the index simplifies its traversal compared to adding stuff as additional key columns, include doesn’t support expressions, so it might actually get larger than a version with an expression on second position. For not- null the whole timestamp gets pulled in there.
    2. This sort of configuration only works if you set up these indexes to contain everything you’re querying, making them covering indexes. This might be a problem with many, especially wide columns, or even a few columns if they’re wide enough.
    3. Again, this was meant as just an addition to the list.
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search