There are two large table num
,pre
in database :
d num
Table "public.num"
Column | Type | Collation | Nullable | Default
----------+-------------------------+-----------+----------+---------------------------------
id | integer | | not null | nextval('num_id_seq'::regclass)
adsh | character varying(20) | | |
tag | character varying(256) | | |
version | character varying(20) | | |
coreg | character varying(256) | | |
ddate | date | | |
qtrs | numeric(8,0) | | |
uom | character varying(20) | | |
value | numeric(28,4) | | |
footnote | character varying(1024) | | |
d pre
Table "public.pre"
Column | Type | Collation | Nullable | Default
----------+------------------------+-----------+----------+---------------------------------
id | integer | | not null | nextval('pre_id_seq'::regclass)
adsh | character varying(20) | | |
report | numeric(6,0) | | |
line | numeric(6,0) | | |
stmt | character varying(2) | | |
inpth | boolean | | |
rfile | character(1) | | |
tag | character varying(256) | | |
version | character varying(20) | | |
plabel | character varying(512) | | |
negating | boolean | | |
Check how many records in the tables:
select count(*) from num;
count
----------
83862587
(1 row)
Time: 204945.436 ms (03:24.945)
select count(*) from pre;
count
----------
36738034
(1 row)
Time: 100604.085 ms (01:40.604)
Execute a long query :
explain analyze select tag,uom,qtrs,value,ddate from num
where adsh='0000320193-22-000108' and tag in
(select tag from pre where stmt='IS' and
adsh='0000320193-22-000108') and ddate='2022-09-30';
It cost almost 7 minutes 30 seconds.
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Semi Join (cost=2000.00..2909871.29 rows=2 width=59) (actual time=357717.922..450523.035 rows=45 loops=1)
Join Filter: ((num.tag)::text = (pre.tag)::text)
Rows Removed by Join Filter: 61320
-> Gather (cost=1000.00..1984125.01 rows=32 width=59) (actual time=190.355..92987.731 rows=678 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on num (cost=0.00..1983121.81 rows=13 width=59) (actual time=348.753..331304.671 rows=226 loops=3)
Filter: (((adsh)::text = '0000320193-22-000108'::text) AND (ddate = '2022-09-30'::date))
Rows Removed by Filter: 27953970
-> Materialize (cost=1000.00..925725.74 rows=43 width=33) (actual time=0.097..527.331 rows=91 loops=678)
-> Gather (cost=1000.00..925725.53 rows=43 width=33) (actual time=65.880..357527.133 rows=96 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on pre (cost=0.00..924721.22 rows=18 width=33) (actual time=201.713..357490.037 rows=32 loops=3)
Filter: (((adsh)::text = '0000320193-22-000108'::text) AND ((stmt)::text = 'IS'::text))
Rows Removed by Filter: 12245979
Planning Time: 0.632 ms
JIT:
Functions: 27
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 5.870 ms, Inlining 272.489 ms, Optimization 367.828 ms, Emission 213.288 ms, Total 859.474 ms
Execution Time: 450524.974 ms
(22 rows)
Time: 450526.084 ms (07:30.526)
How can optimize the database to reduce time on query?Add index and close something (the db running on my local pc without any other users)?
3
Answers
The subselect in the where clause needs to be rewritten into a table join which will be much faster.The DISTINCT may not be necessary, but I don’t know the cardinality of your data, so I included it.
Its performing a full table scan on those columns, as you see from the analysis that says "Parallel Seq Scan" (ie sequential scan) on num and pre.
this means it is looking at every row to check if it should be used.
To speed it up, massively, you need to create an index on the columns used in the where clauses (pre.stmt, pre.adsh, num.adsh and num.ddate). Then the query will use the index to decide which rows to include, and as indexes are specially organised for this task, the performance will increase.
adding a CTE would greatly improve your query :
Also you can specialize one of the used columns if the queries used most frequently are very known :
This would create a very fast index for just a small portion of the table.
Is important to notice that a indexes have very limited usage for non-preplanned selects, for large tables is often better to create a index scan select and re-query the results from a CTE than affording to load the entire table, as example this is a very common predicate of every day db usage and its problems:
Now notice how important is for your queries to be matching the correct indexing
This means if you change the column being researched your indexes should match or they doenst get used, this is same for integercolumn::text = ‘x’ or date::text = ‘2019-10-01’
after adding the proper index this get solved:
as much tables start getting bigger less and less random filters can be allowed, table scan will toggle a S.O memory cache , than later the same data will be doubled in the postgresql cache and only later will stabilize .
Unstable postgresql cached queries will reduce the speed of the previously cached queries whenever a new random query is executed if the cached memory of the server is already on the limit.