Postgresql - Make combinatorial query faster

ERO
April 3, 2024
196 views
0 votes
2 Answers

I am struggling with a Postgres slow query issue.

Fixtures

Consider the following fixtures:

DROP TABLE IF EXISTS expectation;
DROP TABLE IF EXISTS actual;

CREATE TABLE expectation (
  set_id int NOT NULL,
  value int NOT NULL
);
INSERT INTO expectation (set_id, value) 
  SELECT floor(random() * 1000)::int AS set_id, floor(random() * 1000)::int AS value FROM generate_series(1, 2000);

CREATE TABLE actual (
  user_id int NOT NULL,
  value int NOT NULL
);
INSERT INTO actual (user_id, value) 
  SELECT floor(random() * 200000)::int AS user_id, floor(random() * 1000)::int AS value FROM generate_series(1, 1000000);

Feature

We have a table of expectations which represents a list of values and the corresponding set_id. A set_id can have several value.

# SELECT * FROM "expectation" ORDER BY "set_id" LIMIT 10;
 set_id | value 
--------+-------
      0 |   641
      1 |   560
      2 |   872
      3 |    56
      3 |   608
      4 |   652
      5 |   439
      5 |   145
      6 |   510
      6 |   515

And we have a table of actual data assigned values to a user. A user_id can have several value too.

# SELECT * FROM "actual" ORDER BY "user_id" LIMIT 10;
 user_id | value 
---------+-------
       0 |   128
       0 |   177
       0 |   591
       0 |   219
       0 |   785
       0 |   837
       0 |   782
       1 |   502
       1 |   521
       1 |   210

Problem

Now we need to get all users with all the set_id for which they have all the values. in other words, a user must have all the values of a set (probably more) to match it.

My solution is:

# WITH
  expected AS (SELECT set_id, array_agg(value) as values FROM expectation GROUP BY set_id),
  gotten AS (SELECT user_id, array_agg(value) as values FROM actual GROUP BY user_id)
SELECT user_id, array_agg(set_id) FROM gotten
INNER JOIN expected ON expected.values <@ gotten.values
GROUP BY user_id LIMIT 10;
 user_id |        array_agg        
---------+-------------------------
       0 | {525}
       1 | {175,840}
       2 | {336}
       3 | {98,260}
       7 | {416}
       8 | {2,251,261,352,682,808}
       9 | {971}
      10 | {163,485}
      11 | {793}
      12 | {157,332,539,582,617}
(10 rows)
Time: 18960.143 ms (00:18.960)

It returns the expected result, but it takes way too long: ~18 seconds for the given fixture.

Already explored

Note that due to the aggregate limiting it will not earn time to limit the query.
Materialized indexed view of the result could help but the data are frequently changing in my app and I’am not sure it’s a solution for me.
The query plan looks fair to me, I can’t see how I could index anything. Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value))) is slow but I cannot see a better way to do the check fit.

GroupAggregate  (cost=127896.00..2339483.85 rows=200 width=36) (actual time=502.126..23381.752 rows=139712 loops=1)
  Group Key: actual.user_id
  ->  Nested Loop  (cost=127896.00..2335820.38 rows=732194 width=8) (actual time=501.614..23332.035 rows=277930 loops=1)
        Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value)))
        Rows Removed by Join Filter: 171755568
        ->  GroupAggregate  (cost=127757.34..137371.07 rows=169098 width=36) (actual time=500.499..762.447 rows=198653 loops=1)
              Group Key: actual.user_id
              ->  Sort  (cost=127757.34..130257.34 rows=1000000 width=8) (actual time=329.909..476.859 rows=1000000 loops=1)
                    Sort Key: actual.user_id
                    Sort Method: external merge  Disk: 17696kB
                    ->  Seq Scan on actual  (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.014..41.334 rows=1000000 loops=1)
        ->  Materialize  (cost=138.66..177.47 rows=866 width=36) (actual time=0.000..0.019 rows=866 loops=198653)
              ->  GroupAggregate  (cost=138.66..164.48 rows=866 width=36) (actual time=0.551..1.164 rows=866 loops=1)
                    Group Key: expectation.set_id
                    ->  Sort  (cost=138.66..143.66 rows=2000 width=8) (actual time=0.538..0.652 rows=2000 loops=1)
                          Sort Key: expectation.set_id
                          Sort Method: quicksort  Memory: 142kB
                          ->  Seq Scan on expectation  (cost=0.00..29.00 rows=2000 width=8) (actual time=0.020..0.146 rows=2000 loops=1)
Planning Time: 0.243 ms
JIT:
  Functions: 17
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 1.831 ms, Inlining 43.440 ms, Optimization 61.965 ms, Emission 64.892 ms, Total 172.129 ms
Execution Time: 23406.950 ms

Answers

Create an index:

CREATE INDEX ON actual(user_id, value);

This gave me this query plan:

Limit  (cost=39.42..114665.29 rows=10 width=36) (actual time=3.037..7.219 rows=10 loops=1)
"  Output: actual.user_id, (array_agg(expectation.set_id))"
  Buffers: shared hit=10 read=3
  ->  GroupAggregate  (cost=39.42..2292556.70 rows=200 width=36) (actual time=3.035..7.214 rows=10 loops=1)
"        Output: actual.user_id, array_agg(expectation.set_id)"
        Group Key: actual.user_id
        Buffers: shared hit=10 read=3
        ->  Nested Loop  (cost=39.42..2288794.85 rows=751869 width=8) (actual time=1.243..7.193 rows=23 loops=1)
"              Output: actual.user_id, expectation.set_id"
              Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value)))
              Rows Removed by Join Filter: 14435
              Buffers: shared hit=10 read=3
              ->  GroupAggregate  (cost=0.42..33136.01 rows=172447 width=36) (actual time=0.105..0.204 rows=17 loops=1)
"                    Output: actual.user_id, array_agg(actual.value)"
                    Group Key: actual.user_id
                    Buffers: shared hit=1 read=3
                    ->  Index Only Scan using actual_user_id_value_idx on public.actual  (cost=0.42..25980.42 rows=1000000 width=8) (actual time=0.092..0.117 rows=95 loops=1)
"                          Output: actual.user_id, actual.value"
                          Heap Fetches: 0
                          Buffers: shared hit=1 read=3
              ->  Materialize  (cost=39.00..54.26 rows=872 width=36) (actual time=0.057..0.168 rows=850 loops=17)
"                    Output: expectation.set_id, (array_agg(expectation.value))"
                    Buffers: shared hit=9
                    ->  HashAggregate  (cost=39.00..49.90 rows=872 width=36) (actual time=0.965..1.236 rows=872 loops=1)
"                          Output: expectation.set_id, array_agg(expectation.value)"
                          Group Key: expectation.set_id
                          Batches: 1  Memory Usage: 297kB
                          Buffers: shared hit=9
                          ->  Seq Scan on public.expectation  (cost=0.00..29.00 rows=2000 width=8) (actual time=0.006..0.204 rows=2000 loops=1)
"                                Output: expectation.set_id, expectation.value"
                                Buffers: shared hit=9
Settings: enable_partitionwise_join = 'on'
Planning:
  Buffers: shared hit=18 read=1
Planning Time: 1.761 ms
Execution Time: 7.303 ms

Running on PostgreSQL version 16, 9 milliseconds in total.

- ErwinBrandstetter
- April 3, 2024 at 1:42 am
- 0 votes
0
Note that due to the aggregate limiting it will not earn time to limit the query.

That’s not necessarily true. If you build sorted arrays (or if you add an index like Frank demonstrated) Postgres picks a different query plan, where a small LIMIT is much faster:
```
WITH expected AS (
   SELECT set_id, array_agg(value) as values
   FROM  (
      SELECT set_id, value
      FROM   expectation
      ORDER  BY 1, 2
      ) sub
   GROUP  BY 1
   )
, gotten AS (
   SELECT user_id, array_agg(value) as values
   FROM  (
      SELECT user_id, value
      FROM   actual
      ORDER  BY 1, 2
      ) sub
   GROUP  BY 1
   )
SELECT g.user_id, array_agg(set_id)
FROM   gotten   g
JOIN   expected e ON g.values @> e.values
GROUP  BY 1
LIMIT  10;
```
But that hardly helps for a full set without LIMIT. Indexes cannot help much, either.

No query for the full set can be very fast. But a query with a recursive CTE can use indexes to be at least 10x faster than what you have now. At its core, it’s a dynamic case of relational-division.

This works after cleaning up the sample data and adding PRIMARY KEY constraints on expectation (set_id, value) and actual (user_id, value):
```
EXPLAIN (ANALYZE, BUFFERS)  
WITH RECURSIVE rcte AS (
   SELECT a.user_id, e.set_id, value
   FROM  (
      SELECT DISTINCT ON (1)
             set_id, value
      FROM   expectation e
      ORDER  BY 1, 2
    ) e
   JOIN actual a USING (value)

   UNION ALL
   SELECT r.user_id, r.set_id, e.value
   FROM   rcte r
   CROSS  JOIN LATERAL (
      SELECT e.value
    FROM   expectation e
      WHERE  e.set_id = r.set_id
    AND    e.value > r.value
      ORDER  BY e.value
      LIMIT  1
      ) e
   JOIN   actual a ON (a.user_id, a.value) = (r.user_id, e.value)
    )   
SELECT user_id, array_agg(set_id)
FROM   rcte
GROUP  BY 1;
```
fiddle

Related:
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Make combinatorial query faster

Fixtures

Feature

Problem

Already explored

Answers