skip to Main Content

We are trying to optimize a query to a partitioned table, the query looks something like this:

SELECT col1, col2
FROM partitioned_table
WHERE profile_id = '00000000-0000-0000-0000-000000000000'
AND product_id = 'product_a'
ORDER BY created_at DESC
LIMIT 500;

Parent/partitioned table definition:

CREATE TABLE public.partitioned_table (
    trade_id integer NOT NULL,
    product_id character varying NOT NULL,
    settled boolean DEFAULT false NOT NULL,
    user_id public.mongo_id NOT NULL,
    profile_id uuid NOT NULL,
    created_at timestamp with time zone NOT NULL
)
PARTITION BY RANGE (created_at);

Index used to scan:

CREATE INDEX partitioned_profile_id_product_id_trade_id_idx ON ONLY public.partitioned_table USING btree (profile_id, product_id, trade_id) INCLUDE (created_at);

The definition is on the partitioned table, partitions themselves were created after the index is added to the partitioned table, so they have the same set of indexes.

Each partition contains a day worth of data, about 12 million rows.
We run Postgres 14.5 on AWS RDS.

This is the query plan:

                                                       QUERY PLAN
----------------------------------------------------------------------------------------------
 Limit  (cost=944.59..945.84 rows=500 width=202) (actual time=39.501..39.691 rows=500 loops=1)
   ->  Sort  (cost=944.59..947.09 rows=1000 width=202) (actual time=39.499..39.660 rows=500 loops=1)
         Sort Key: partitioned_table.created_at DESC
         Sort Method: top-N heapsort  Memory: 290kB
         ->  Append  (cost=0.71..894.76 rows=1000 width=202) (actual time=0.030..27.204 rows=32867 loops=1)
               ->  Index Scan using partitioned_table_profile_id_product_id_trade_id_idx on partitioned_table_legacy partitioned_table_1  (cost=0.71..772.99 rows=379 width=116) (actual time=0.029..22.550 rows=32838 loops=1)
                     Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
               ->  Index Scan using partition_20240601_profile_id_product_id_trade_id_created_idx on partition_20240601 partitioned_table_2  (cost=0.56..12.65 rows=5 width=117) (actual time=0.019..0.019 rows=0 loops=1)
                     Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
               ->  Index Scan using partition_20240602_profile_id_product_id_trade_id_created_idx on partition_20240602 partitioned_table_3  (cost=0.56..12.65 rows=5 width=117) (actual time=0.011..0.011 rows=0 loops=1)
                     Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
               ->  Index Scan using partition_20240603_profile_id_product_id_trade_id_created_idx on partition_20240603 partitioned_table_4  (cost=0.56..18.68 rows=8 width=117) (actual time=0.014..0.017 rows=3 loops=1)
                     Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
               ->  Index Scan using partition_20240604_profile_id_product_id_trade_id_created_idx on partition_20240604 partitioned_table_5  (cost=0.56..4.58 rows=1 width=117) (actual time=0.013..0.014 rows=2 loops=1)
                     Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
               ->  Index Scan using partition_20240605_profile_id_product_id_trade_id_created_idx on partition_20240605 partitioned_table_6  (cost=0.56..16.66 rows=7 width=117) (actual time=0.020..0.021 rows=2 loops=1)
                     Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
               ->  Index Scan using partition_20240606_profile_id_product_id_trade_id_created_idx on partition_20240606 partitioned_table_7  (cost=0.56..14.67 rows=6 width=117) (actual time=0.013..0.014 rows=1 loops=1)
                     Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
               ->  Index Scan using partition_20240607_profile_id_product_id_trade_id_created_idx on partition_20240607 partitioned_table_8  (cost=0.56..36.90 rows=17 width=117) (actual time=0.015..0.037 rows=21 loops=1)
                     Index Cond: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
               ->  Seq Scan on partition_20240608 partitioned_table_9  (cost=0.00..0.00 rows=1 width=265) (actual time=0.014..0.015 rows=0 loops=1)
                     Filter: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
               ->  Seq Scan on partition_20240609 partitioned_table_10  (cost=0.00..0.00 rows=1 width=265) (actual time=0.004..0.004 rows=0 loops=1)
                     Filter: ((profile_id = '00000000-0000-0000-0000-000000000000'::uuid) AND ((product_id)::text = 'product_a'::text))
...

Query and query plan obfuscated, and the query plan continues on and does sequential scan for all future/empty partitions.

Upon observing the query plan I have two questions:

  1. Although we specified ORDER BY created_at DESC , query plan still scans the partitions in forward chronological order, can it be reversed since it is a backward sort ?

  2. We pro-actively created two years worth of future partitions to reduce operational cost. However, since this query does not have the partitioning column created_at in the WHERE clause, it is scanning all the future/empty partitions even after enough records is fetched, basically ignoring the LIMIT clause. How to make it stop scanning when it finds enough records?

I’ve been mostly reading docs, have not been able to find much insight.

2

Answers


  1. Bad query plan

    We want a plan with Merge Append, and partitions listed in (descending) order of their partition key, and Postgres stopping the scan as soon as LIMIT is satisfied. Like here:

    But we actually see AppendSortLimit

    The release notes for Postgres 15 have this juicy item:

    • Allow ordered scans of partitions to avoid sorting in more cases (David Rowley)

      Previously, a partitioned table with a DEFAULT partition or a
      LIST partition containing multiple values could not be used for
      ordered partition scans. Now they can be used if such partitions are
      pruned during planning.

    (There were many other improvements, so upgrading to a current version should help in any case!)

    Indeed, it would seem you have such a default partition:

    ->  Index Scan using partitioned_table_profile_id_product_id_trade_id_idx
        on partitioned_table_legacy partitioned_table_1
        (cost=0.71..772.99 rows=379 width=116)
        (actual time=0.029..22.550 rows=32838 loops=1)
    

    Bold emphasis mine.
    You don’t tell us, but the deviating table name indicates as much.

    You would still have to exclude the default partition in current Postgres to make it work. (And merge it manually.) But we have an explanation why it does not work in the first place.

    The second item of interest here is the bad estimate (not the core problem). Postgres expects 379 rows, but finds 32838. Estimates for combined filters are notoriously hard, but that’s still bad.

    • Are your column statistics up to date? Run ANALYZE partitioned_table_legacy and test again.

    • Increasing the STATISTIC target might help. See:

    • But for combined filters, you may really need extended statistics like:

      CREATE STATISTICS (mcv) ON profile_id, product_id FROM public.partitioned_table;
      

    Related:

    Manual intervention

    If all else fails, you have to do it manually. I would write a PL/pgSQL function that iterates through partitions with dynamic SQL and adaptive LIMIT. That’s beyond my limits for unpaid work. Related:

    Better index

    Once you achieve an ordered partition scan (with upgraded Postgres and/or improved query) this index would be ideal:

    CREATE INDEX ON public.partitioned_table (profile_id, product_id, created_at DESC);
    

    If col1, col2 are small columns, you might attach them to the index with an INCLUDE clause to allow index-only scans.

    Do not include the ONLY keyword at this time, since we want that index created for every existing partition, too.


    it is scanning all the future/empty partitions even after enough records is fetched,

    Your query says ORDER BY created_at DESC. So those future partitions come first and have to be scanned in any case. If you want to exclude future rows, add a WHERE clause accordingly.

    Login or Signup to reply.
  2. Scanning the partitions in the correct order (and so using an Append with no Sort either above or below the Append) is only done where there is an index which allows to the rows to be returned already in the desired order within each partition.

    I don’t know why the decision is made this way, or where in the code it is implemented. But on the face of it, it seems like a dumb way to do it. When there is no ordering index (or there is one but it is a bad choice for other reasons) is exactly when such an optimization would be most useful–it would let you sort the rows of only one partition (or however many it took to find the LIMIT) rather than sorting rows from all of them.

    In any case, to your issue at hand. A very good index for this would be on (profile_id, product_id, created_at) as then it could apply the equality test to the first two columns and then immediately return rows ordered by the third one. This would both be faster within each partition, as well as more efficient at stopping early before visiting unneeded partitions.

    A simpler index just on (created_at) would also be enough to invoke the ordered-partition optimization, but then it would be left to filter out rows not matching the desired profile_id and product_id in an inefficient way, and so is unlikely to be the better method if most rows need to be filtered.

    We pro-actively created two years worth of future partitions to reduce
    operational cost. However, since this query does not have the
    partitioning column created_at in the WHERE clause, it is scanning all
    the future/empty partitions even after enough records is fetched,
    basically ignoring the LIMIT clause. How to make it stop scanning when
    it finds enough records?

    In your plan, the scanning of those future partitions takes essentially no time, so why does it matter that they are scanned? Anyway, should any of those not really be empty then the rows they did have would be the first ones returned, so they can hardly be skipped as unnecessary as that would mean returning the wrong data.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search