skip to Main Content

Say you want to optimize a query in a postgres database like:

SELECT DISTINCT ON (first)
    first,
    second,
    third
FROM my_table
WHERE
    second > 100
    AND fourth = 3
ORDER BY
    first,
    second DESC,
    third DESC

(EDIT: In this example, let’s say fourth=3 is about 25% of rows, and second > 100 is only around 5%)

Where you want to select the first column for a table based on a couple filter conditions and ordering by three other conditions. As far as I know, the best way to do this would be to create an index on first and second, then an index on first, second DESC, third DESC. Unfortunately, the second index doesn’t seem to be used when I analyze the query.

Is this the ideal way to create these indexes, or could there be a single index that unifies both the filtering and sorting.

Secondly, I’m wondering, is there a way to ensure you’re picking the best index strategy given a table, or if this could be analytically determined based on your dataset and query?

When I run this now, this is my current output from explain:

 Unique  (cost=207985.91..208536.43 rows=18682 width=78) (actual time=823.330..965.769 rows=5248 loops=1)
   ->  Sort  (cost=207985.91..208261.17 rows=110104 width=78) (actual time=823.328..935.933 rows=348232 loops=1)
         Sort Key: first, second DESC, third DESC
         Sort Method: external merge  Disk: 31176kB
         ->  Index Scan using ix_my_table_second_fourth on my_table  (cost=0.44..193872.52 rows=110104 width=78) (actual time=0.017..103.031 rows=348232 loops=1)
               Index Cond: ((fourth = 3) AND (second > 100))
 Planning Time: 0.315 ms
 Execution Time: 971.174 ms

So you can see it uses the ix_my_table_second_fourth to filter, but a significant majority of the time is spent sorting the query so the value with the highest second and third value for each first column is attained.

2

Answers


  1. While DISTINCT ON is the best query technique (and it does not pay to emulate an index-skip scan instead), it should be most efficient to optimize the index for row selection rather then pre-sorting:

    CREATE INDEX ON tbl (fourth, second DESC NULLS LAST) INCLUDE (first, third);
    

    second > 100 is more selective than fourth = 3, but that hardly matters for the order of index columns, as long as both are in the lead. The deciding factor: equality comes before range predicates. The column first and third do not contribute to filtering, so those may as well move into the INCLUDE section to make the index a bit smaller.

    Related:

    Preliminary advice based on incomplete information.

    Login or Signup to reply.
  2. The best index would probably be a partial one, but since the constant which "second" is tested against is apparently not fixed that wouldn’t be feasible.

    To avoid the (unusually expensive) sort, you could have an index which supports the ORDERing directly. You already report having an index on (first, second DESC, third DESC), which would support the ORDER BY but that index is needlessly inefficient as it provides no efficient way of removing the wrong values of "fourth", nor does it support an index-only scan. A better index for avoiding the sort would be:

    create index on my_table (fourth, first, second DESC, third DESC)
    

    This gets to jump directly to the correct value for "fourth", and then scan in the correct order within those. It still needs to filter out the wrong values on the "second" condition one by one, but at least it can do that as an index-only scan rather than needing to hop all over the table. (It would of course be better to jump to the correct value of the more selective "second", but since that is an inequality test there is no single correct value; so that is not possible to do while maintaining the ordering property.)

    Based on your existing plan not using an bitmap scan despite returning many thousands of rows from the index, I am assuming your table data is highly correlated on "fourth". By not including "fourth" in your ordering index, you get the double whammy that it can’t use an index-only scan, and with not being the first column it doesn’t have good locality and so it has to jump all over the table. (Or at least I suspect the planner thinks it will have to).

    If it won’t naturally choose my proposed index, or if you want to force the use of your existing ordering index just to see what would happen if it did choose it, you should be able to do that by doing set enable_sort=off in the session before running the query.

    Secondly, I’m wondering, is there a way to ensure you’re picking the best index strategy given a table, or if this could be analytically determined based on your dataset and query?

    The planner tries to pick the best index to use. Of course there is no gaurantee it succeeds. Not in general, and especially not when your row estimate are substantially off.

    If the index I propose is not good enough, the last resort may be to partition the table on "second". Then it could skip processing the partitions that can’t meet the inequality, while still using the ordering index on the remaining partitions with a Merge Append to combine them.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search