I am struggling with a Postgres slow query issue.
Fixtures
Consider the following fixtures:
DROP TABLE IF EXISTS expectation;
DROP TABLE IF EXISTS actual;
CREATE TABLE expectation (
set_id int NOT NULL,
value int NOT NULL
);
INSERT INTO expectation (set_id, value)
SELECT floor(random() * 1000)::int AS set_id, floor(random() * 1000)::int AS value FROM generate_series(1, 2000);
CREATE TABLE actual (
user_id int NOT NULL,
value int NOT NULL
);
INSERT INTO actual (user_id, value)
SELECT floor(random() * 200000)::int AS user_id, floor(random() * 1000)::int AS value FROM generate_series(1, 1000000);
Feature
We have a table of expectations which represents a list of values and the corresponding set_id
. A set_id
can have several value
.
# SELECT * FROM "expectation" ORDER BY "set_id" LIMIT 10;
set_id | value
--------+-------
0 | 641
1 | 560
2 | 872
3 | 56
3 | 608
4 | 652
5 | 439
5 | 145
6 | 510
6 | 515
And we have a table of actual data assigned values to a user. A user_id
can have several value
too.
# SELECT * FROM "actual" ORDER BY "user_id" LIMIT 10;
user_id | value
---------+-------
0 | 128
0 | 177
0 | 591
0 | 219
0 | 785
0 | 837
0 | 782
1 | 502
1 | 521
1 | 210
Problem
Now we need to get all users with all the set_id
for which they have all the values. in other words, a user must have all the values of a set (probably more) to match it.
My solution is:
# WITH
expected AS (SELECT set_id, array_agg(value) as values FROM expectation GROUP BY set_id),
gotten AS (SELECT user_id, array_agg(value) as values FROM actual GROUP BY user_id)
SELECT user_id, array_agg(set_id) FROM gotten
INNER JOIN expected ON expected.values <@ gotten.values
GROUP BY user_id LIMIT 10;
user_id | array_agg
---------+-------------------------
0 | {525}
1 | {175,840}
2 | {336}
3 | {98,260}
7 | {416}
8 | {2,251,261,352,682,808}
9 | {971}
10 | {163,485}
11 | {793}
12 | {157,332,539,582,617}
(10 rows)
Time: 18960.143 ms (00:18.960)
It returns the expected result, but it takes way too long: ~18 seconds for the given fixture.
Already explored
- Note that due to the aggregate limiting it will not earn time to limit the query.
- Materialized indexed view of the result could help but the data are frequently changing in my app and I’am not sure it’s a solution for me.
- The query plan looks fair to me, I can’t see how I could index anything.
Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value)))
is slow but I cannot see a better way to do the check fit.
GroupAggregate (cost=127896.00..2339483.85 rows=200 width=36) (actual time=502.126..23381.752 rows=139712 loops=1)
Group Key: actual.user_id
-> Nested Loop (cost=127896.00..2335820.38 rows=732194 width=8) (actual time=501.614..23332.035 rows=277930 loops=1)
Join Filter: ((array_agg(expectation.value)) <@ (array_agg(actual.value)))
Rows Removed by Join Filter: 171755568
-> GroupAggregate (cost=127757.34..137371.07 rows=169098 width=36) (actual time=500.499..762.447 rows=198653 loops=1)
Group Key: actual.user_id
-> Sort (cost=127757.34..130257.34 rows=1000000 width=8) (actual time=329.909..476.859 rows=1000000 loops=1)
Sort Key: actual.user_id
Sort Method: external merge Disk: 17696kB
-> Seq Scan on actual (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.014..41.334 rows=1000000 loops=1)
-> Materialize (cost=138.66..177.47 rows=866 width=36) (actual time=0.000..0.019 rows=866 loops=198653)
-> GroupAggregate (cost=138.66..164.48 rows=866 width=36) (actual time=0.551..1.164 rows=866 loops=1)
Group Key: expectation.set_id
-> Sort (cost=138.66..143.66 rows=2000 width=8) (actual time=0.538..0.652 rows=2000 loops=1)
Sort Key: expectation.set_id
Sort Method: quicksort Memory: 142kB
-> Seq Scan on expectation (cost=0.00..29.00 rows=2000 width=8) (actual time=0.020..0.146 rows=2000 loops=1)
Planning Time: 0.243 ms
JIT:
Functions: 17
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.831 ms, Inlining 43.440 ms, Optimization 61.965 ms, Emission 64.892 ms, Total 172.129 ms
Execution Time: 23406.950 ms
2
Answers
Create an index:
This gave me this query plan:
Running on PostgreSQL version 16, 9 milliseconds in total.
That’s not necessarily true. If you build sorted arrays (or if you add an index like Frank demonstrated) Postgres picks a different query plan, where a small
LIMIT
is much faster:But that hardly helps for a full set without
LIMIT
. Indexes cannot help much, either.No query for the full set can be very fast. But a query with a recursive CTE can use indexes to be at least 10x faster than what you have now. At its core, it’s a dynamic case of relational-division.
This works after cleaning up the sample data and adding
PRIMARY KEY
constraints onexpectation (set_id, value)
andactual (user_id, value)
:fiddle
Related: