I am not an expert at Postgres/GIS subjects and I have an issue with a large database (over 20 million records) of geometries. First of all my set up looks like this:
mmt=# select version();
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------
version | PostgreSQL 13.2 (Debian 13.2-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
mmt=# select PostGIS_Version();
-[ RECORD 1 ]---+--------------------------------------
postgis_version | 3.1 USE_GEOS=1 USE_PROJ=1 USE_STATS=1
The table that I am querying contains the following columns:
mmt=# d titles
Table "public.titles"
Column | Type | Collation | Nullable | Default
----------------------+--------------------------+-----------+----------+-----------------------------------------
ogc_fid | integer | | not null | nextval('titles_ogc_fid_seq'::regclass)
wkb_geometry | bytea | | |
timestamp | timestamp with time zone | | |
end | timestamp with time zone | | |
gml_id | character varying | | |
validfrom | character varying | | |
beginlifespanversion | character varying | | |
geom_bounding_box | geometry(Geometry,4326) | | |
Indexes:
"titles_pkey" PRIMARY KEY, btree (ogc_fid)
"geom_idx" gist (geom_bounding_box)
The geom_bounding_box column holds the bounding box of the wkb_geometry. I have created that bounding box column because the wkb geometries exceed the default size limits for items in a GIST index. Some of them are quite complex geometries with several dozens of points making up a polygon. Using a bounding box instead meant I was able to put an index on that column as a way of speeding up the search.. at least that’s the theory.
My search aims to find geometries which fall within 100 metres of a given point as follows, however this takes well over two minutes to return. I want to get that under one second!:
select ogc_fid, web_geometry from titles where ST_DWithin(geom_bounding_box, 'SRID=4326;POINT(-0.145872 51.509691)'::geography, 100);
Below is a basic explain output. What can I do to speed this thing up?
Thank you!
mmt=# explain select ogc_fid from titles where ST_DWithin(geom_bounding_box, 'SRID=4326;POINT(-0.145872 51.509691)'::geography, 100);
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Gather (cost=1000.00..243806855.33 rows=2307 width=4)
-[ RECORD 2 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Workers Planned: 2
-[ RECORD 3 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Parallel Seq Scan on titles (cost=0.00..243805624.63 rows=961 width=4)
-[ RECORD 4 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Filter: st_dwithin((geom_bounding_box)::geography, '0101000020E61000006878B306EFABC2BF6308008E3DC14940'::geography, '100'::double precision, true)
-[ RECORD 5 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | JIT:
-[ RECORD 6 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Functions: 4
-[ RECORD 7 ]----------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Options: Inlining true, Optimization true, Expressions true, Deforming true
3
Answers
I did resolve this and it was a combination of all of the above things, although not any one of them alone. As a quick summary:
Laurenz Albe was right in spotting the mix of geography and geometry types, which was easy to fix by removing the cast.
Ian Turton was also right in spotting that dozens of points shouldn't be an issue for a gist index, so I abandoned the bounding box approximation approach and went back to exploring the index issues. What I found was that the geometry column was defined with a data type of 'byte array' (bytea), which prevents creation of an spgist index due to 'no default operator class for access method "spgist"' This was resolved by changing the column type as follows:
The index then creates successfully (either gist or spgist) and I have been able to benchmark the two side by side, finding gist to be slightly more efficient in my use-case.
Amanin was also right to point out the differences between meters and radial degrees according to the spatial reference system. In some of my tests I was erroneously using the latter, but on very large radii. Since I'm indexing and searching with geometry types, that radius value needs to be very small in radial degrees in order to cover quite large areas. Fixed!
All put together, and searches across 26 million records consistently complete in 200ms to 500ms, with occasional spikes up to 1.1s. This is pretty good.
Thanks all who contributed input, ideas and discussion.
The problem is that you are mixing
geometry
andgeography
, and PostgreSQL castsgeom_bounding_box
togeography
so that they match.Now you have indexed
geom_bounding_box
, but notgeom_bounding_box::geography
, which is something different.Either use
'SRID=4326;POINT(-0.145872 51.509691)'::geometry
as second operand or create the GiST index on((geom_bounding_box::geography))
(note the double parentheses).EDIT:
As pointed out by mlinth, my answer below is not really valid. It raises a danger though: beware of the arguments given to the ST_DWithin function, because the unit of distance argument is inferred differently depending if you give geographies (meters) or geometries (srid unit).
According to the ST_DWithin doc, the distance is specified in SRID unit. In your case, the spatial reference system is a geographic one, so your 100 value means 100 degree radius, not 100 meters. That means approximately the entire world. In such case, efficiently using the index is impossible.
If you want to find geometries in a 100 meter radius, you must convert a 100 meter in degree unit, but that depends on latitude (if you want to be accurate).
To start, I’d recommend you to use a (very) approximate shortcut: 100 meters at the equator is (very) approximately equal to 0.001 degrees. So replace your distance value with it, and if it speed up things (and I’m pretty convinced it will), then you will be able to refine your query to be more accurate.