Most of the time, our PostgreSQL (version 13.8, AWS Aurora, PostgreSQL compatible) database’s buffer cache hit ratio has been somewhere around 95~98%. However, we have been observing an occasional dip in the buffer cache hit ratio to 70%-80%. I managed to track the occurrences of these dips to vacuum activity. Whenever there’s vacuum activity, we see the buffer cache hit ratio dip to 70-80%.
I can’t find a direct statement in any documentation that says "yes, vacuum activity evicts the buffer and so it causes the buffer cache hit ratio to drop significantly". The best I’ve managed is to find an explanation for how vacuuming works, and in the pseudocode it says that "it will scan all pages to get dead tuples" which I assume will cause these pages to load to the buffer cache and fill it with maybe not so relevant data, ultimately causing the hit ratio to suffer.
Am I on the right path?
What are references on this relation between vacuum activity and the drop in buffer cache hit ratio?
Vacuum documentation in official PostgreSQL documentation has no mention of its impact on the buffer cache.
2
Answers
The cache hit ratio drops when the database reads lots of pages that are not cached. That’s exactly what
VACUUM
does, so what you observe is not very surprising. Don’t worry.People tend to overestimate the relevance of the cache hit ratio. There is nothing evil about doing some I/O when necessary.
VACUUM will use a ring buffer access strategy which prevents it from evicting very many other pages, as it preferentially evicts its own recent pages. But the VACUUM may itself need to visit a large number of cold pages, which will of course need to be read. So the VACUUM can cause the global hit rate to drop not because it causes other processes to have pages misses, but because the vacuum itself has page misses.