PostgreSQL v12. Database cluster size: 12+ Tb (30+ databases 100-600Gb each). Server: 40 cores, 128Gb ram.
For some reason, autovacuum stopped running. How to understand the reason?
Yesterday, the workers were running every minute for each base (according to the settings), but then they stopped and are no longer running.
Settings:
autovacuum = on
autovacuum_max_workers = 4
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 50
autovacuum_analyze_threshold = 50
autovacuum_vacuum_scale_factor = 0.10
autovacuum_analyze_scale_factor = 0.05
autovacuum_vacuum_cost_delay = 0
autovacuum_vacuum_cost_limit = 1000
For example, here is a table in one of the databases:
relname: ods_storage_sync_queue
n_live_tup: 8
n_dead_tup: 180542
autovacuum_count: 74167
last_autovacuum: 2023-09-26 18:48:11.868
last_autoanalyze: 2023-09-26 18:48:11.869
The table contains an infinite number of dead tuples, but autovacuum does not clean it.
Another example:
relname: ods_storage_sync_queue
n_live_tup: 11434
n_dead_tup: 2064
autovacuum_count: 16727
last_autovacuum: 2023-09-26 18:13:21.612
last_autoanalyze: 2023-09-26 18:47:11.530
There are 18% dead tupples. According to autovacuum_vacuum_scale_factor = 10%, autovacuum should have started a long time ago, but a day has passed… nothing happens, although yesterday everything started once a minute, as set in autovacuum_naptime.
What else can I check? How to understand the reason?
2
Answers
The described problem was solved by updating PostgreSQL from 12.1 to 12.16. We found in Release Notes that somewhere along the way from 12.1 to 12.16 an error was fixed that caused autovacuum to be stuck.
The one possibility is that autovacuum really does not run. You should set
log_autovacuum_min_duration = 0
and examine your log file to see if that is the case. Reasons for autovacuum not starting areThe statistics collector process has a problem. That is not the case here, since you see the number of dead tuples increasing.
All 4 slots of
autovacuum_max_workers
are already taken by long-running autovacuum workers, and other tables that needVACUUM
have to wait until one of them finishes. You can diagnose that withpg_stat_activity
, and the solution would be to increaseautovacuum_max_workers
.However, it is likely that autovacuum does run, but is prevented from cleaning up dead tuples. As detailed in this article, there are four possible reasons:
a concurrent long running transaction
a stale replication slot
an orphaned prepared transaction
a lagging standby with
hot_standby_feedback = on