I have a postgres table that looks like this:
CREATE UNLOGGED TABLE IF NOT EXISTS batch_update_data (
id TEXT DEFAULT gen_random_uuid()::TEXT
, account_id TEXT
, processed boolean
...
);
it has an index that looks like this:
CREATE INDEX batch_update_data__processed_false_idx ON batch_update_data(processed) WHERE processed = FALSE;
when ever I batch load data into that table, I run a query like this right after:
SELECT
id,
...
FROM
batch_update_data
WHERE
processed = false
LIMIT
10
but this doesn’t hit an index until I run this:
VACUUM ANALYZE public.batch_update_data;
I can see vacuum running during the batch load in the logs so why doesn’t it hit an index until I manually run it?
my current auto vacuum settings are the following:
"autovacuum" "on"
"autovacuum_analyze_scale_factor" "0.1"
"autovacuum_analyze_threshold" "50"
"autovacuum_freeze_max_age" "200000000"
"autovacuum_max_workers" "3"
"autovacuum_multixact_freeze_max_age" "400000000"
"autovacuum_naptime" "60"
"autovacuum_vacuum_cost_delay" "2"
"autovacuum_vacuum_cost_limit" "-1"
"autovacuum_vacuum_scale_factor" "0.2"
"autovacuum_vacuum_threshold" "50"
"autovacuum_work_mem" "-1"
"log_autovacuum_min_duration" "0"
—– EDIT —-
here is the explain(analyze, verbose, buffers) output on the query.. I’m using pg11 so I don’t think the settings
is an option?
"Limit (cost=0.00..2.49 rows=10 width=100) (actual time=3.050..3.056 rows=10 loops=1)"
" Output: id, account_id, customer_id, account_open_date, account_close_date, sec_pool_id, product_code, latest_card_last_4, card_art, card_bank, expiration_date"
" Buffers: shared hit=1 read=11"
" -> Seq Scan on public.batch_update_data (cost=0.00..864045.52 rows=3474519 width=100) (actual time=3.048..3.051 rows=10 loops=1)"
" Output: id, account_id, customer_id, account_open_date, account_close_date, sec_pool_id, product_code, latest_card_last_4, card_art, card_bank, expiration_date"
" Filter: (NOT batch_update_data.processed)"
" Rows Removed by Filter: 605"
" Buffers: shared hit=1 read=11"
"Planning Time: 3.282 ms"
"Execution Time: 3.095 ms"
2
Answers
It seems like you don’t run the bulk load in a single transaction, so that autovacuum or autoanalyze kick in in the middle of loading data. That means that there are two options:
the part of the bulk load that ran after the last autoanalyze modified less than 10% of your table, so autoanalyze didn’t trigger, and the statistics for the last loaded values are bad
you didn’t wait long enough after the bulk load to give autoanalyze time to finish
Both problems would be fixed by running an explicit
ANALYZE
on the table right after the bulk load is done. That is the best practice, particularly if you intend to run queries right after the bulk load.It’s not lightning fast, but 3 milliseconds isn’t slow either. Run ANALYZE after the batch import. A limit without an order by is usually not a good idea. Create an index on the order by column (id or account_id) and use the column processed in the where condition:
And your new query:
And of course you use a UUID for the id column, not text: