I have a couple of tables in PostgreSQL 14.6, with indexes as stated here:
# d uprn
Table "public.uprn"
Column | Type | Collation | Nullable | Default
-----------+----------------------+-----------+----------+---------
uprn | text | | |
postcode | text | | |
location | geometry(Point,4326) | | |
Indexes:
"uprn_location" gist (location)
"uprn_postcode" btree (postcode)
"uprn_uprn" btree (uprn)
# d geometry_subdivided
Table "public.geometry_subdivided"
Column | Type | Collation | Nullable | Default
---------+----------+-----------+----------+---------
area_id | integer | | |
geom | geometry | | |
Indexes:
"subdivided_area_id" btree (area_id)
"subdivided_geom_id" gist (geom)
And this many rows (VACUUM ANALYZE
has been done post import):
# select count(*) from uprn;
count
----------
32872945
(1 row)
# select count(*) from geometry_subdivided;
count
--------
938500
(1 row)
23 of the uprn
rows have a postcode
of "M19 1TF". If I run a query joining these two tables on that postcode, looking for areas covering the points, it takes a number of seconds, using a sequence scan:
# explain analyze select area_id,count(*) from uprn u, geometry_subdivided g where st_covers(g.geom, st_transform(u.location,27700)) and postcode='M19 1TF' group by area_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=6287745.55..6292743.36 rows=16428 width=12) (actual time=6273.505..6290.003 rows=16 loops=1)
Group Key: g.area_id
-> Gather Merge (cost=6287745.55..6292414.80 rows=32856 width=12) (actual time=6273.497..6289.990 rows=48 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=6286745.52..6287622.38 rows=16428 width=12) (actual time=6254.442..6254.452 rows=16 loops=3)
Group Key: g.area_id
-> Sort (cost=6286745.52..6286983.05 rows=95010 width=4) (actual time=6254.431..6254.434 rows=69 loops=3)
Sort Key: g.area_id
Sort Method: quicksort Memory: 25kB
Worker 0: Sort Method: quicksort Memory: 33kB
Worker 1: Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=25.29..6278890.20 rows=95010 width=4) (actual time=4756.836..6254.376 rows=69 loops=3)
-> Parallel Seq Scan on uprn u (cost=0.00..1264850.55 rows=101 width=32) (actual time=4725.962..6221.730 rows=8 loops=3)
Filter: (postcode = 'M19 1TF'::text)
Rows Removed by Filter: 10957641
-> Index Scan using subdivided_geom_id on geometry_subdivided g (cost=25.29..49643.02 rows=94 width=2040) (actual time=0.102..0.253 rows=9 loops=23)
Index Cond: (geom ~ st_transform(u.location, 27700))
Filter: st_covers(geom, st_transform(u.location, 27700))
Rows Removed by Filter: 7
Planning Time: 0.359 ms
Execution Time: 6290.100 ms
(22 rows)
But if I discourage PostgreSQL from using sequence scans, the same query takes milliseconds:
# set enable_seqscan to off;
SET
# explain analyze select area_id,count(*) from uprn u, geometry_subdivided g where st_covers(g.geom, st_transform(u.location,27700)) and postcode='M19 1TF' group by area_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=12076657.05..12078536.73 rows=17175 width=12) (actual time=7.710..7.766 rows=16 loops=1)
Group Key: g.area_id
-> Sort (cost=12076657.05..12077226.36 rows=227724 width=4) (actual time=7.681..7.714 rows=207 loops=1)
Sort Key: g.area_id
Sort Method: quicksort Memory: 34kB
-> Nested Loop (cost=31.60..12053278.12 rows=227724 width=4) (actual time=0.203..7.603 rows=207 loops=1)
-> Bitmap Heap Scan on uprn u (cost=6.31..966.53 rows=242 width=32) (actual time=0.035..0.073 rows=23 loops=1)
Recheck Cond: (postcode = 'M19 1TF'::text)
Heap Blocks: exact=10
-> Bitmap Index Scan on uprn_postcode (cost=0.00..6.25 rows=242 width=0) (actual time=0.024..0.025 rows=23 loops=1)
Index Cond: (postcode = 'M19 1TF'::text)
-> Index Scan using subdivided_geom_id on geometry_subdivided g (cost=25.29..49802.00 rows=94 width=2038) (actual time=0.116..0.322 rows=9 loops=23)
Index Cond: (geom ~ st_transform(u.location, 27700))
Filter: st_covers(geom, st_transform(u.location, 27700))
Rows Removed by Filter: 7
Planning Time: 0.259 ms
Execution Time: 7.851 ms
(17 rows)
Why is PostgreSQL not using the index in the normal query, and is there a way I can reword the query to do so? I’ve tried a number of combinations but without luck so far
2
Answers
One of the possibilities I had tried was moving the uprn table lookup into the table definition (as
select * from uprn where postcode='M19 1TF'
on its own used the index), but that made no difference:However, if I add an ORDER BY to that inner query, it then does use the index!
This answers my question in terms of making it performant, which is the main concern, though I don't understand why PostgreSQL does not use the index in the first place.
On a shallow level, it is easy to see what is going on. It thinks that the geometry index scans are by far the most costly part of the plan, and it thinks that using the seq scan is the key to doing that index scan in parallel. So the higher cost of the seq scan is believed to be worth it to get the parallelization. So an easy solution to this problem, if you generally don’t get much benefit from parallelization, is to set max_parallel_workers_per_gather = 0.
On a deeper level, it is harder to see. Why does it think the seq scan is key to the parallel plan? ‘Parallel Bitmap Heap Scan’ does exist in this version, so why not use that? I think it is because the expected number of rows from that table is low enough (242) that the bitmap can’t be easily divided up for parallelization, while a seq scan is easier to divide. (I have not looked into the source code to verify this theory). If the expected number of rows were higher, it would use the ‘Parallel Bitmap Heap Scan’, if it were much lower, it wouldn’t think it were important to parallelize this in the first place.
So another possible solution is to fix the estimation problem where it thinks the postcode will find 242 rows but only does find 23. You should be able to do this by increasing default_statistics_target, or increasing the statistics parameter just for that column, and redoing the ANALYZE. But I’m not sure just getting it down to 23 rows will be enough to forgo the parallel plan.
The other half of the row estimation problem is where each iteration of the geometry index thinks it will find 94 rows, but really only finds 9. But there is nothing simple you can do about this, as this type of scan on a geometry index doesn’t look at the data distributions, it just always assumes it will return 1/10000 of the table.
As for your attempt with the subselect, the planner "sees through" this formulation and so comes up with the same plan as it otherwise would. Usually this see-through leads to better plans, but here it inhibits your attempt to force a better plan. To avoid this, you could use a materialized CTE which inhibits such planner see-through.