skip to Main Content

I have a postgres table that looks like this:

CREATE UNLOGGED TABLE IF NOT EXISTS batch_update_data (
 id TEXT DEFAULT gen_random_uuid()::TEXT
, account_id TEXT
, processed boolean
...
);

it has an index that looks like this:
CREATE INDEX batch_update_data__processed_false_idx ON batch_update_data(processed) WHERE processed = FALSE;

when ever I batch load data into that table, I run a query like this right after:

SELECT
 id,
...
FROM
  batch_update_data
WHERE
  processed = false
LIMIT
  10

but this doesn’t hit an index until I run this:
VACUUM ANALYZE public.batch_update_data;

I can see vacuum running during the batch load in the logs so why doesn’t it hit an index until I manually run it?

my current auto vacuum settings are the following:

"autovacuum"    "on"
"autovacuum_analyze_scale_factor"   "0.1"
"autovacuum_analyze_threshold"  "50"
"autovacuum_freeze_max_age" "200000000"
"autovacuum_max_workers"    "3"
"autovacuum_multixact_freeze_max_age"   "400000000"
"autovacuum_naptime"    "60"
"autovacuum_vacuum_cost_delay"  "2"
"autovacuum_vacuum_cost_limit"  "-1"
"autovacuum_vacuum_scale_factor"    "0.2"
"autovacuum_vacuum_threshold"   "50"
"autovacuum_work_mem"   "-1"
"log_autovacuum_min_duration"   "0"

—– EDIT —-

here is the explain(analyze, verbose, buffers) output on the query.. I’m using pg11 so I don’t think the settings is an option?

"Limit  (cost=0.00..2.49 rows=10 width=100) (actual time=3.050..3.056 rows=10 loops=1)"
"  Output: id, account_id, customer_id, account_open_date, account_close_date, sec_pool_id, product_code, latest_card_last_4, card_art, card_bank, expiration_date"
"  Buffers: shared hit=1 read=11"
"  ->  Seq Scan on public.batch_update_data  (cost=0.00..864045.52 rows=3474519 width=100) (actual time=3.048..3.051 rows=10 loops=1)"
"        Output: id, account_id, customer_id, account_open_date, account_close_date, sec_pool_id, product_code, latest_card_last_4, card_art, card_bank, expiration_date"
"        Filter: (NOT batch_update_data.processed)"
"        Rows Removed by Filter: 605"
"        Buffers: shared hit=1 read=11"
"Planning Time: 3.282 ms"
"Execution Time: 3.095 ms"

2

Answers


  1. It seems like you don’t run the bulk load in a single transaction, so that autovacuum or autoanalyze kick in in the middle of loading data. That means that there are two options:

    • the part of the bulk load that ran after the last autoanalyze modified less than 10% of your table, so autoanalyze didn’t trigger, and the statistics for the last loaded values are bad

    • you didn’t wait long enough after the bulk load to give autoanalyze time to finish

    Both problems would be fixed by running an explicit ANALYZE on the table right after the bulk load is done. That is the best practice, particularly if you intend to run queries right after the bulk load.

    Login or Signup to reply.
  2. It’s not lightning fast, but 3 milliseconds isn’t slow either. Run ANALYZE after the batch import. A limit without an order by is usually not a good idea. Create an index on the order by column (id or account_id) and use the column processed in the where condition:

    CREATE INDEX batch_update_data_processed_false_idx ON batch_update_data(id) WHERE processed = FALSE;
    

    And your new query:

    SELECT
     id,
    ...
    FROM
      batch_update_data
    WHERE
      processed = false
    ORDER BY id ASC -- sorting on id, that's in the index
    LIMIT
      10;
    

    And of course you use a UUID for the id column, not text:

    ALTER TABLE batch_update_data
        ALTER COLUMN id SET DEFAULT (gen_random_uuid());
    
    ALTER TABLE batch_update_data
        ALTER COLUMN id TYPE uuid USING(CAST(id AS UUID));
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search