skip to Main Content

Background

I’m getting insanely slow queries when using small LIMIT clause values on a relatively simple query.

I’ve already read and re-read PostgreSQL query very slow with limit 1. (not that I couldn’t have missed anything, but it’s related, and folks shouldn’t just direct me there without giving me a clue as to what I might be missing first).

I’m definitely encountering essentially the bug referred to there: for small values of limit, this query takes around 7,409,626 vs. 53 ns.

Simply changing LIMIT from 1 to 1000 gives me the instantaneous speed, dropping it to 10 or 1 gives me the OMGBBQ what is wrong here speed.

I tried applying the basic advice from the above linked SO: add an extra useless ORDER BY column to the query to trick the planner.

However, in my case, the planner is very slow even if I try to put the main query without limits in a WITH clause!

The Query

select id 
from rounds 
where userid = (
  select id
  from users
  where integrationuserid = 'sample:64ce5bad-8c48-44a4-b473-5a7451980bb2') 
order by created desc 
limit 1;

Some explain-analyze results:

Naive query with Limit = 1

explain analyze select id from rounds where userid = (select id from users where integrationuserid = 'sample:64ce5bad-8c48-44a4-b473-5a7451980bb2') order by created desc, userid limit 1;
                 QUERY PLAN
------------------------------------------------
 Limit  (cost=3.07..47.03 rows=1 width=40) (actual time=7408.097..7408.099 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using users_integrationuserid_idx on users  (cost=0.41..2.63 rows=1 width=16) (actual time=0.013..0.014 rows=1 loops=1)
           Index Cond: (integrationuserid = 'sample:64ce5bad-8c48-44a4-b473-5a7451980bb2'::text)
   ->  Index Scan using recent_rounds_idx on rounds  (cost=0.44..938182.73 rows=21339 width=40) (actual time=7408.096..7408.096 rows=1 loops=1)
         Filter: (userid = $0)
         Rows Removed by Filter: 23123821
 Planning Time: 0.133 ms
 Execution Time: 7408.114 ms
(9 rows)

vs. Limit = 1000 (arbitrary, just to see what happens)

explain analyze select id from rounds where userid = (select id from users where integrationuserid = 'sample:64ce5bad-8c48-44a4-b473-5a7451980bb2') order by created desc, userid limit 1000;
                    QUERY PLAN
------------------------------------------------
 Limit  (cost=24163.47..24165.97 rows=1000 width=40) (actual time=0.048..0.049 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Index Scan using users_integrationuserid_idx on users  (cost=0.41..2.63 rows=1 width=16) (actual time=0.018..0.019 rows=1 loops=1)
           Index Cond: (integrationuserid = 'sample:64ce5bad-8c48-44a4-b473-5a7451980bb2'::text)
   ->  Sort  (cost=24160.84..24214.18 rows=21339 width=40) (actual time=0.047..0.048 rows=1 loops=1)
         Sort Key: rounds.created DESC
         Sort Method: quicksort  Memory: 25kB
         ->  Bitmap Heap Scan on rounds  (cost=226.44..22990.84 rows=21339 width=40) (actual time=0.043..0.043 rows=1 loops=1)
               Recheck Cond: (userid = $0)
               Heap Blocks: exact=1
               ->  Bitmap Index Scan on rounds_userid_idx  (cost=0.00..221.10 rows=21339 width=0) (actual time=0.040..0.040 rows=1 loops=1)
                     Index Cond: (userid = $0)
 Planning Time: 0.108 ms
 Execution Time: 0.068 ms
(14 rows)

My basic questions are:

  1. Why is it so omg bad in the first place (in what universe would scanning all rows in the entire db beat scanning a smaller subset after applying the where clause)?
  2. How can I work around this?

AFAICT – I need the planner to reduce the original table to just the rows that match the WHERE clause – then apply the sort & limit to that.

But instead it applies the sort & limit to the entire table – in this case – around 23M items – and with unsurprisingly horrible results.

I’ve tried many dozens of syntaxes that try to create a subquery that extracts the user’s rounds first, and then tries to apply the limit. But again, the planner sees through this and applies the limit on the original full 23M items.

Additional Attempts / Information

This article indicates that the original answer (to my initial link, above) no longer works as of pg 13 – and to use a CTE.

Postgres SLOWER when a LIMIT is set: how to fix besides adding a dummy `ORDER BY`?

However, that was basically my first instinct – and all uses of a CTE have failed for me.

One attempt with a CTE: (VERY SLOW!)

with r as (
  select id, created 
  from rounds 
  where userid = (
    select id
    from users 
    where integrationuserid = 'sample:64ce5bad-8c48-44a4-b473-5a7451980bb2')
) 
select r.id from r order by r.created desc limit 1;

Maybe doing random stuff with moving order by and limit around helps? (NO!)

with r as (select id, created from rounds where userid = (select id from users where integrationuserid = ‘sample:64ce5bad-8c48-44a4-b473-5a7451980bb2’) order by created desc) select r.id from r limit 1;

Solution (thanks to @jjanes for pointing me to it, and to @denis-de-bernardy for sharing it in the first place):
Solution, btw:

create index recent_rounds_idx on rounds (userid, created desc);

2

Answers


  1. The core problem is the bad estimate for the index scan: 21339 rows estimated, when there is actually only one. ANALYZE the table and see if that is enough to improve the estimate. If not, increase the statistics for the column:

    ALTER TABLE rounds ALTER userid SET STATISTICS 500;
    ANALYZE rounds;
    

    If you want to afford the perfect index for the query, the problem should go away too:

    CREATE INDEX ON rounds (userid, created);
    

    Then you could drop the index ´rounds_userid_idx`, because the new index can do everything that that index can (unless it is a unique index).

    Login or Signup to reply.
  2. Why is it so omg bad in the first place (in what universe would scanning all rows in the entire db beat scanning a smaller subset after applying the where clause)?

    The universe where it doesn’t think that that will happen. By traversing the index for rows already in the desired order, it thinks it can stop after finding the first row meeting the required userid, which means it thinks it can stop after reading 1/21340 of the table. This is obviously very wrong, because there is only 1 row meeting that condition, not 21339 of them like it thinks.

    So a side question, why is the estimate so very wrong? Since it is making a generic estimate here (at planning time, it doesn’t know what the value of $0 will turn out to be), it guesses the number of qualifying rows will be the estimated number of rows in the table, divided by the estimate for n_distinct for userid (adjusted for null values, but I will ignore that for simplicity). So either this is correct and there is some value(s) of userid for which there are an awful lot of rows (but they just happen not to be the values you are looking for) and those ones pull up the average.

    Or it is incorrect and the estimate for n_distinct is way off from reality. We can’t know which from afar. You could figure it out by looking in pg_stats.n_distinct to see the estimate, and comparing that the true value of count(distinct userid) from rounds. If the estimate is way off, you can manually fix it by doing alter table rounds alter column userid set (N_DISTINCT = <true value>). You need to ANALYZE the table before this will take effect. The problem of grossly wrong n_distinct is due to inadequate selection of the sample used for computing stats (every row is equally likely to be chosen to be in the sample, but the selection is not strictly independent for each row from the others), and increasing STATISTICS as Laurenz recommends is only weakly able to overcome this problem. while setting N_DISTINCT tackles it directly.

    One attempt with a CTE: (VERY SLOW!)

    Your attempt was very close. But modern PostgreSQL can "see through" the CTE and plan it the same was as it would plan the original query. To prevent that, you need to declare the CTE as being materialized.

    with r as materialized (
      ....
    

    This is just by way of explanation of course, because having the multicolumn index is a simple powerful solution. By giving you the best of both worlds (both selectivity on userid and ordering on created, at the same time) it removes the need to make fraught decisions about which one is more important, like the planner needs to do when you have the two single-column indexes.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search