I want to implement efficient pagination in Postgresql (14.5): give me a few rows associated with some a
, that come immediately after some b
. Unfortunately Index Cond
disappears from under the Index Only Scan
as soon as I try to fetch the a
and b
values from the database itself. Is there a way to avoid it?
The last query in this long list is doing what I want, but slowly. Everything else is presented for completeness to showcase different aspects of the problem.
Creating table and data
CREATE TABLE foo (id SERIAL PRIMARY KEY, a INT NOT NULL, b INT NOT NULL);
CREATE UNIQUE INDEX foo_a_b ON foo (a, b);
INSERT INTO foo (a, b)
SELECT *
FROM generate_series(0, 9) a_table
CROSS JOIN generate_series(1, 100000) b_table;
Static starting point, row comparison, lower bound
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (5, 50000) < (a, b)
ORDER BY a, b
LIMIT 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..0.69 rows=3 width=8) (actual time=0.051..0.056 rows=3 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.43..31717.57 rows=367608 width=8) (actual time=0.049..0.052 rows=3 loops=1)
Index Cond: (ROW(a, b) > ROW(5, 50000))
Heap Fetches: 3
Planning Time: 0.171 ms
Execution Time: 0.113 ms
Lightning fast!
Static starting point, reversed row comparison, lower bound
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (a, b) > (5, 50000)
ORDER BY a, b
LIMIT 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.51 rows=3 width=8) (actual time=0.048..0.051 rows=3 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.42..11350.75 rows=398533 width=8) (actual time=0.046..0.048 rows=3 loops=1)
Index Cond: (ROW(a, b) > ROW(5, 50000))
Heap Fetches: 0
Planning Time: 0.169 ms
Execution Time: 0.089 ms
Still lightning fast!
Static starting point, element, lower bound
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (5 = a AND 50000 < b) OR 5 < a
ORDER BY a, b
LIMIT 3;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.66 rows=3 width=8) (actual time=68.243..68.244 rows=3 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.42..33480.43 rows=428662 width=8) (actual time=68.241..68.242 rows=3 loops=1)
Filter: (((5 = a) AND (50000 < b)) OR (5 < a))
Rows Removed by Filter: 550000
Heap Fetches: 0
Planning Time: 0.216 ms
Execution Time: 68.277 ms
Filter kills performance. Lesson learned: optimizer in to smart enough, so just stick to row comparisons.
Inline starting point, row comparison, lower bound
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (SELECT ROW(a, b) FROM foo WHERE id = 550000) < (a, b)
ORDER BY a, b
LIMIT 3
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8.87..9.12 rows=3 width=8) (actual time=137.267..137.268 rows=3 loops=1)
InitPlan 1 (returns $0)
-> Index Scan using foo_pkey on foo foo_1 (cost=0.42..8.44 rows=1 width=32) (actual time=0.021..0.024 rows=1 loops=1)
Index Cond: (id = 550000)
-> Index Only Scan using foo_a_b on foo (cost=0.42..28480.42 rows=333333 width=8) (actual time=137.264..137.265 rows=3 loops=1)
Filter: ($0 < ROW(a, b))
Rows Removed by Filter: 550000
Heap Fetches: 0
Planning Time: 0.219 ms
Execution Time: 137.310 ms
Filter kills performance.
Inline starting point, reversed row comparison, lower bound
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (a, b) > (SELECT ROW(a, b) FROM foo WHERE id = 550000)
ORDER BY a, b
LIMIT 3
ERROR: subquery has too few columns
LINE 3: WHERE (a, b) > (SELECT ROW(a, b) FROM foo WHERE id = 550000)
Failure. A wat moment for me. Could be a bug.
Static starting point, row comparison, lower bound and upper bound
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (5, 50000) < (a, b) AND (a, b) < (440, 35)
ORDER BY a, b
LIMIT 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.52 rows=3 width=8) (actual time=0.055..0.058 rows=3 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.42..12347.08 rows=398533 width=8) (actual time=0.052..0.054 rows=3 loops=1)
Index Cond: ((ROW(a, b) > ROW(5, 50000)) AND (ROW(a, b) < ROW(440, 35)))
Heap Fetches: 0
Planning Time: 0.224 ms
Execution Time: 0.104 ms
Lightning fast still!
Static starting point, row comparison, fixed first column and lower bounded second column
EXPLAIN (ANALYZE) SELECT a, b
FROM foo
WHERE (5, 50000) < (a, b) AND a = 5
ORDER BY a, b
LIMIT 3;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.52 rows=3 width=8) (actual time=4.579..4.581 rows=3 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.42..1237.23 rows=39840 width=8) (actual time=4.576..4.578 rows=3 loops=1)
Index Cond: ((ROW(a, b) > ROW(5, 50000)) AND (a = 5))
Heap Fetches: 0
Planning Time: 0.213 ms
Execution Time: 4.622 ms
Much slower, but tolerable. Optimizer is smart enough to combine and equality and the row comparison together and to preserve the Index Cond
.
Starting point from WITH
clause, row comparison, fixed first column and lower bounded second column
EXPLAIN (ANALYZE)
WITH starting_point AS (SELECT ROW(a, b) r FROM foo WHERE id = 550000)
SELECT a, b
FROM foo
WHERE (SELECT r FROM starting_point) < (a, b) AND a = (SELECT a FROM starting_point)
ORDER BY a, b
LIMIT 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8.89..100.63 rows=3 width=8) (actual time=136.362..136.365 rows=3 loops=1)
CTE starting_point
-> Index Scan using foo_pkey on foo foo_1 (cost=0.42..8.44 rows=1 width=32) (actual time=0.025..0.027 rows=1 loops=1)
Index Cond: (id = 550000)
InitPlan 2 (returns $1)
-> CTE Scan on starting_point (cost=0.00..0.02 rows=1 width=32) (actual time=0.029..0.032 rows=1 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.42..50980.43 rows=1667 width=8) (actual time=136.359..136.361 rows=3 loops=1)
Filter: (($1 < ROW(a, b)) AND (a = (SubPlan 3)))
Rows Removed by Filter: 550000
Heap Fetches: 0
SubPlan 3
-> CTE Scan on starting_point starting_point_1 (cost=0.00..0.02 rows=1 width=4) (actual time=0.000..0.001 rows=1 loops=3)
Planning Time: 0.311 ms
Execution Time: 136.423 ms
This is the query that I want to run faster.
UPD: Slight improvement.
EXPLAIN (ANALYZE)
WITH starting_point AS (SELECT ROW(a, b) r, a FROM foo WHERE id = 550000)
SELECT a, b
FROM foo
WHERE (SELECT r FROM starting_point) < (a, b) AND a = (SELECT a FROM starting_point)
ORDER BY a, b
LIMIT 3;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8.91..9.19 rows=3 width=8) (actual time=41.532..41.535 rows=3 loops=1)
CTE starting_point
-> Index Scan using foo_pkey on foo foo_1 (cost=0.42..8.44 rows=1 width=36) (actual time=0.036..0.038 rows=1 loops=1)
Index Cond: (id = 550000)
InitPlan 2 (returns $1)
-> CTE Scan on starting_point (cost=0.00..0.02 rows=1 width=32) (actual time=0.001..0.002 rows=1 loops=1)
InitPlan 3 (returns $2)
-> CTE Scan on starting_point starting_point_1 (cost=0.00..0.02 rows=1 width=4) (actual time=0.041..0.044 rows=1 loops=1)
-> Index Only Scan using foo_a_b on foo (cost=0.42..3100.43 rows=33333 width=8) (actual time=41.529..41.530 rows=3 loops=1)
Index Cond: (a = $2)
Filter: ($1 < ROW(a, b))
Rows Removed by Filter: 50000
Heap Fetches: 0
Planning Time: 0.314 ms
Execution Time: 41.598 ms
For some reason this is somewhat faster.
3
Answers
Building on the answer of jjanes, here's what I ended up having:
Learnings:
ROW()
should be avoided, otherwise the planner fails.I can’t explain why it works this way, but just remove the
row()
from the one that fails and it will succeed and use the fast plan:But if you reverse the comparison, then this
row()
-less select construct fails for returning too many columns, go figure. Maybe when theselect
comes after the main table(a,b)
, it then knows what to expect from the select and so analyzes it correctly.If you want to select the rows that match the
a
of some starting row and are greater thanb
of that row, why not use the simplest possible query:The execution plan is as expected