Let’s say I have the following PostgreSQL database schema:
Group
id: int
Task:
id: int
created_at: datetime
group: FK Group
I have the following Materialized View to calculate the number of tasks and the most recent Task.created_at
value per group:
CREATE MATERIALIZED VIEW group_statistics AS (
SELECT
group.id as group_id,
MAX(task.created_at) AS latest_task_created_at,
COUNT(task.id) AS task_count
FROM group
LEFT OUTER JOIN task ON (group.id = task.group_id)
GROUP BY group.id
);
The Task
table currently has 20 million records so refreshing this materialized view takes a long time (20-30 seconds). We’ve also been experiencing some short but major DB performance issues ever since we started refreshing the materialized every 10 min, even with CONCURRENTLY:
REFRESH MATERIALIZED VIEW CONCURRENTLY group_statistics;
Is there a more performant way to calculate these values? Note, they do NOT need to be exact. Approximate values are totally fine, e.g. latest_task_created_at
can be 10-20 min delayed.
I’m thinking of caching these values on every write to the Task
table. Either in Redis or in PostgreSQL itself.
Update
People are requesting the execution plan. EXPLAIN
doesn’t work on REFRESH
but I ran EXPLAIN
on the actual query. Note, it’s different than my theoretical data model above. In this case, Database
is Group
and Record
is Task
. Also note, I’m on PostgreSQL 12.10.
EXPLAIN (analyze, buffers, verbose)
SELECT
store_database.id as database_id,
MAX(store_record.updated_at) AS latest_record_updated_at,
COUNT(store_record.id) AS record_count
FROM store_database
LEFT JOIN store_record ON (store_database.id = store_record.database_id)
GROUP BY store_database.id;
Output:
HashAggregate (cost=1903868.71..1903869.22 rows=169 width=32) (actual time=18227.016..18227.042 rows=169 loops=1)
" Output: store_database.id, max(store_record.updated_at), count(store_record.id)"
Group Key: store_database.id
Buffers: shared hit=609211 read=1190704
I/O Timings: read=3385.027
-> Hash Right Join (cost=41.28..1872948.10 rows=20613744 width=40) (actual time=169.766..14572.558 rows=20928339 loops=1)
" Output: store_database.id, store_record.updated_at, store_record.id"
Inner Unique: true
Hash Cond: (store_record.database_id = store_database.id)
Buffers: shared hit=609211 read=1190704
I/O Timings: read=3385.027
-> Seq Scan on public.store_record (cost=0.00..1861691.23 rows=20613744 width=40) (actual time=0.007..8607.425 rows=20928316 loops=1)
" Output: store_record.id, store_record.key, store_record.data, store_record.created_at, store_record.updated_at, store_record.database_id, store_record.organization_id, store_record.user_id"
Buffers: shared hit=609146 read=1190704
I/O Timings: read=3385.027
-> Hash (cost=40.69..40.69 rows=169 width=16) (actual time=169.748..169.748 rows=169 loops=1)
Output: store_database.id
Buckets: 1024 Batches: 1 Memory Usage: 16kB
Buffers: shared hit=65
-> Index Only Scan using store_database_pkey on public.store_database (cost=0.05..40.69 rows=169 width=16) (actual time=0.012..0.124 rows=169 loops=1)
Output: store_database.id
Heap Fetches: 78
Buffers: shared hit=65
Planning Time: 0.418 ms
JIT:
Functions: 14
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 2.465 ms, Inlining 15.728 ms, Optimization 92.852 ms, Emission 60.694 ms, Total 171.738 ms"
Execution Time: 18229.600 ms
Note, the large execution time. It sometimes takes 5-10 minutes to run. I would love to bring this down to consistently a few seconds max.
Update #2
People are requesting the execution plan when the query takes minutes. Here it is:
HashAggregate (cost=1905790.10..1905790.61 rows=169 width=32) (actual time=128442.799..128442.825 rows=169 loops=1)
" Output: store_database.id, max(store_record.updated_at), count(store_record.id)"
Group Key: store_database.id
Buffers: shared hit=114011 read=1685876 dirtied=367
I/O Timings: read=112953.619
-> Hash Right Join (cost=15.32..1874290.39 rows=20999810 width=40) (actual time=323.497..124809.521 rows=21448762 loops=1)
" Output: store_database.id, store_record.updated_at, store_record.id"
Inner Unique: true
Hash Cond: (store_record.database_id = store_database.id)
Buffers: shared hit=114011 read=1685876 dirtied=367
I/O Timings: read=112953.619
-> Seq Scan on public.store_record (cost=0.00..1862849.43 rows=20999810 width=40) (actual time=0.649..119522.406 rows=21448739 loops=1)
" Output: store_record.id, store_record.key, store_record.data, store_record.created_at, store_record.updated_at, store_record.database_id, store_record.organization_id, store_record.user_id"
Buffers: shared hit=113974 read=1685876 dirtied=367
I/O Timings: read=112953.619
-> Hash (cost=14.73..14.73 rows=169 width=16) (actual time=322.823..322.824 rows=169 loops=1)
Output: store_database.id
Buckets: 1024 Batches: 1 Memory Usage: 16kB
Buffers: shared hit=37
-> Index Only Scan using store_database_pkey on public.store_database (cost=0.05..14.73 rows=169 width=16) (actual time=0.032..0.220 rows=169 loops=1)
Output: store_database.id
Heap Fetches: 41
Buffers: shared hit=37
Planning Time: 5.390 ms
JIT:
Functions: 14
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 1.306 ms, Inlining 82.966 ms, Optimization 176.787 ms, Emission 62.561 ms, Total 323.620 ms"
Execution Time: 128474.490 ms
5
Answers
Your MV currently has 169 rows, so write costs are negligible (unless you have locking issues). It’s all about the expensive sequential scan over the big table.
Full counts are slow
Getting exact counts per group ("database") is expensive. There is no magic bullet for that in Postgres. Postgres has to count all rows. If the table is all-visible (visibility map is up to date), Postgres can shorten the procedure somewhat by only traversing a covering index. (You did not provide indexes …)
There are possible shortcuts with an estimate for the total row count in the whole table. But the same is not easily available per group. See:
But not that slow
That said, your query can still be substantially faster. Aggregate before the join:
See:
And use the slightly faster (and equivalent in this case)
count(*)
. Related:Also – visibility provided –
count(*)
can use any non-partial index, preferably the smallest, whilecount(store_record.id)
is limited to an index on that column (and has to inspect values, too).I/O is your bottleneck
You added the
EXPLAIN
plan for an expensive execution, and the skyrocketing I/O cost stands out. It dominates the cost of your query.Fast plan:
Slow plan:
Your
Seq Scan on public.store_record
spent 112953.619 ms on reading data file blocks. 367 dirtied buffers represent under 3MB and are only a tiny fraction of total I/O. Either way, I/O dominates the cost.Either your storage system is abysmally slow or, more likely since I/O of the fast query costs 30x less, there is too much contention for I/O from concurrent work load (on an inappropriately configured system). One or more of these can help:
Hugely faster without count
If there was no count, just
latest_record_updated_at
, this query would deliver that in close to no time:In combination with a matching index! Ideally:
See:
The same index can also help the complete query above, even if not as dramatically. If the table is vacuumed enough (visibility map up to date) Postgres can do a sequential scan on the smaller index without involving the bigger table. Obviously matters more for wider table rows – especially easing your I/O problem.
(Of course, index maintenance adds costs, too …)
Upgrade to use parallelism
Upgrade to the latest version of Postgres if at all possible. Postgres 14 or 15 have received various performance improvements over Postgres 12. Most importantly, quoting the release notes for Postgres 14:
Could be massive for your use case. Related:
Estimates?
Warning: experimental stuff.
You stated:
I see only 169 groups ("databases") in the query plan. Postgres maintains column statistics. While the distinct count of groups is tiny and stays below the "statistics target" for the column
store_record.database_id
(which you have to make sure of!), we can work with this. See:Unless you have very aggressive
autovacuum
settings, to get better estimates, runANALYZE
ondatabase_id
to update column statistics before running below query. (Also updatesreltuples
andrelpages
inpg_class
.):Or even (to also update the visibility map for above query):
This was the most expensive part (with collateral benefits). And it’s optional.
The query relies on various Postgres internals and may break in the future major versions (though unlikely). Tested with Postgres 14, but works with Postgres 12, too. It’s basically black magic. You need to know what you are doing. You have been warned.
But the query costs close to nothing.
Take exact values for
latest_record_updated_at
from the fast query above, and join to these estimates for the count.Basic explanation: Postgres maintains column statistics in the system catalog
pg_statistic
.pg_stats
is a view on it, easier to access. Among other things, "most common values" and their relative frequency are gathered. Represented inmost_common_vals
andmost_common_freqs
. Multiplied with the current (estimated) total count, we get estimates per group. You could do all of it manually, but Postgres is probably much faster and better at this.For the computation of the total estimate
ct.total_est
see:(Note the "Safe and explicit" form for this query.)
Given the explain plan, the sequential scan seems to be causing the slowness. An index can definitely help there.
You can also utilize index-only scans as there are few columns in the query. So you can use something like this for store_record table.
An index on id column on the store_database table is also needed.
Parallel plan
If you first collect the
store_record
statistics and then join that with thestore_database
, you’ll get a better, parallelisable plan.Here’s a demo – at the end you can see both queries return the exact same results, but the one I suggest runs faster and has a more flexible plan. The number of workers dispatched depends on your
max_worker_processes
,max_parallel_workers
,max_parallel_workers_per_gather
settings as well as some additional logic inside the planner.With more rows in
store_record
the difference will be more pronounced. On my system with 40 million test rows it went down from 14 seconds to 3 seconds with one worker, 1.4 seconds when it caps out dispatching six workers out of 16 available.Caching
If it’s an option, it’s worth a try – you can maintain proper accuracy and instantly available statistics at the cost of some (deferrable) table throughput overhead. You can replace your
materialized view
with a regular table or add the statistics columns tostore_database
And leave keeping the table up to date to a
trigger
that fires each timestore_record
changes.Making the trigger
deferrable
decouples its execution time from the main operation but it’ll still infer its costs at the end of the transaction.TRUNCATE
trigger cannot be declaredFOR EACH ROW
with the rest of events, so it has to be defined separatelyIn my test, an update or delete of 10000 random rows in a 100-million-row table run in seconds. A single insert of 1000 new, randomly generated rows took 25ms without and 200ms with the trigger. A million was 30s and 3 minutes correspondingly.
A demo.
Partitioning-backed parallel plan
store_record
might be a good fit for partitioning:Splitting objects in this manner lets the planner split their scans accordingly, which works best with parallel workers, but doesn’t require them. Even your initial, unaltered query behind the view is able to take advantage of this structure:
Demo. It’s best to test what granularity gives the best performance on your setup. You can even test sub-partioning, giving each
store_record.database_id
a partition, that is then sub-partitioned into date ranges, simplifying access to most recent entries.Sometimes in such cases it is necessary to think of completely different business logic solutions.
For example, the count operation is a very slow query. This cannot be accelerated by any means in DB. What can be done in such cases? Since I do not know your business logic in full detail, I will tell you several options. However, these options also have disadvantages. For example:
We group it once and insert the numbers into a table.
After then, when each record is inserted into main table then we update the group table using with triggers. Like as this:
Or like that:
I am not talking about small details here. For updating row properly, we can use clause
for update
, sofor update
locks row for other transactions.So, the main solution is that: Functions like
count
need to be executed separately on grouped data, not on the entire table at once. Similar solutions can be applied. I explained it for general understanding.The disadvantage of this solution is that: if you have many inserting operations on this main table, so performance of inserting will be decrease.
MATERIALIZED VIEW is not a good idea for that …
If you just want to "calculate the number of tasks and the most recent Task.created_at value per group" then I suggest you to simply :
Add two columns in the
group
table :Update these 2 columns from trigger fonctions defined on table
task
:You will need to perform a setup action at the beginning …
… but then you will have no more performance issue with the queries !!!
and you will optimize the size of your database …