skip to Main Content

I find my self sometimes needing to select many items of different tables based on a list of id’s or other variables. Given that this list can contain thousands of elements, it is important for me that the filter does not become a n*m query. I have experienced that under some circumstances postgres makes fast lookups and sometimes not, for example if I filter by primary key. When it made slow filtering I typically solved it by creating an IN query table and join it with the table to make the filter.

But now I’m back at the problem again and I really want to hear if somebody can help me build some intuition for when I should assume that lookup is used and when a hash table is used.

At my current work we are using postgres 15, but most of my experience is from postgres 10, so there might be some bad habits from older versions.

2

Answers


  1. Why assume if you can explain analyze verbose. It’s hard to comment on why IN worked so well without seeing examples of what it replaced and how.

    Even having good enough intuition to trust it, you’re still better off if you trust, but verify. Server config, index setup, table size, traffic and how often it’s vacuum analyzed can all have impact on what the planner comes up with, and it can change version to version. All human intuition in this context is doomed to be unreliable and get quickly outdated.

    That being said, here are some 1M-row tests at db<>fiddle, here are some comparisons of in, any, exists, join behaviour, here’s some more, more visual.

    Login or Signup to reply.
  2. Previously I was using column = ANY (VALUES (...), ...) at some point instead of making N batch requests with IN because it was faster (basically it creates a temp table and then joins with it) but looks like in Postgres 15 it’s not a case and regular IN even with 2000 values works 2x faster than value list filter. (But maybe at some N it’s still faster that IN)

    Having sku_test table setup as

    CREATE TABLE sku_test (
        code integer PRIMARY KEY
    );
    

    and with around 22k records

    EXPLAIN ANALYZE 
    SELECT * FROM sku_test
    WHERE code IN (64309, 64310, ...);
    

    results in

    Seq Scan on sku_test  (cost=5.00..450.03 rows=2000 width=4) (actual time=0.343..13.693 rows=2000 loops=1)
      Filter: (code = ANY ('{64309, 64310, ...}'::integer[]))
      Rows Removed by Filter: 20869
    Planning Time: 7.619 ms
    Execution Time: 14.275 ms
    

    and

    EXPLAIN ANALYZE 
    SELECT * FROM sku_test
    WHERE code = ANY(VALUES (64309), (64310), ...);
    

    results in

    Hash Semi Join  (cost=50.00..462.97 rows=2000 width=4) (actual time=4.169..28.998 rows=2000 loops=1)
      Hash Cond: (sku_test.code = "*VALUES*".column1)
      ->  Seq Scan on sku_test  (cost=0.00..330.69 rows=22869 width=4) (actual time=0.055..10.293 rows=22869 loops=1)
      ->  Hash  (cost=25.00..25.00 rows=2000 width=4) (actual time=4.033..4.040 rows=2000 loops=1)
            Buckets: 2048  Batches: 1  Memory Usage: 87kB
            ->  Values Scan on "*VALUES*"  (cost=0.00..25.00 rows=2000 width=4) (actual time=0.010..2.344 rows=2000 loops=1)
    Planning Time: 6.780 ms
    Execution Time: 29.746 ms
    

    And I see similar results on production database with live table and index on filtering column.

    But again, it’s always better to measure by yourself.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search