I have a table race_racer
that contains time serialized logs for each racer in a race. I am trying to select the lastest record for a given racer in x
number of races prior to a date. Basically selecting their finalized stats. The table has approximately 64mm records. I’m using a framework that doesn’t support window functions and DB functions are not an option.
I cannot select based on a list of race_ids
because a racer might not be in a given race.
Postgres 15
Table DLL
CREATE TABLE race_racer (
race_racer_id SERIAL PRIMARY KEY,
log_id integer NOT NULL,
racer_id integer NOT NULL,
race_id integer NOT NULL,
stats jsonb,
created_at timestamp without time zone NOT NULL DEFAULT now()
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX race_racer_pkey ON race_racer(race_racer_id int4_ops);
CREATE INDEX race_racer_log_id_idx ON race_racer(log_id int4_ops);
CREATE INDEX race_racer_racer_id_idx ON race_racer(racer_id int4_ops);
CREATE INDEX race_racer_race_id_idx ON race_racer(race_id int4_ops);
CREATE INDEX race_racer_created_at_idx ON race_racer(created_at timestamp_ops);
race_racer_id
is an auto increment and log_id
the record id from our vendor. It’s possible for duplicate log_id
records to exist as the data is populated in batches.
Query
SELECT MAX(p2.race_racer_id) race_racer_id
FROM race_racer p2
WHERE p2.racer_id = 10002093
AND p2.created_at < '2024-08-01'
GROUP BY p2.log_id, p2.race_id
ORDER BY p2.log_id DESC
LIMIT 25
EXPLAIN(ANALYZE,BUFFERS,SETTINGS,VERBOSE)
Limit (cost=115.70..2445.14 rows=25 width=12) (actual time=150899.392..369576.667 rows=25 loops=1)
Output: (max(race_racer_id)), log_id, race_id
Buffers: shared hit=3984768 read=553269 written=66
-> GroupAggregate (cost=115.70..9417387.87 rows=101068 width=12) (actual time=150899.390..369576.638 rows=25 loops=1)
Output: max(race_racer_id), log_id, race_id
Group Key: p2.log_id, p2.race_id
Buffers: shared hit=3984768 read=553269 written=66
-> Incremental Sort (cost=115.70..9415611.55 rows=102085 width=12) (actual time=147171.318..369575.442 rows=7578 loops=1)
Output: log_id, race_id, race_racer_id
Sort Key: p2.log_id DESC, p2.race_id
Presorted Key: p2.log_id
Full-sort Groups: 26 Sort Method: quicksort Average Memory: 28kB Peak Memory: 28kB
Pre-sorted Groups: 26 Sort Method: quicksort Average Memory: 41kB Peak Memory: 41kB
Buffers: shared hit=3984768 read=553269 written=66
-> Index Scan Backward using race_racer_log_id_idx on public.race_racer p2 (cost=0.44..9411732.46 rows=102085 width=12) (actual time=142302.220..369572.386 rows=7865 loops=1)
Output: log_id, race_id, race_racer_id
Filter: ((p2.created_at < '2024-08-01 00:00:00'::timestamp without time zone) AND (p2.racer_id = 10002093))
Rows Removed by Filter: 4547156
Buffers: shared hit=3984765 read=553269 written=66
Settings: effective_cache_size = '19256MB', effective_io_concurrency = '300', maintenance_io_concurrency = '300', random_page_cost = '1', work_mem = '25671kB'
Query Identifier: 7029425313897609646
Planning:
Buffers: shared hit=182 read=10
Planning Time: 11.919 ms
Execution Time: 369579.279 ms
The query is executing in about 45 seconds which is incredibly slow. Please help! If I select the latest race data for a specific racer using race_id, it resolves in under 20ms.
I’ve tried making indexes and changing the query around to no avail.
2
Answers
This query could benefit from an index like this:
You might have to change/optimize the order of the columns in the index.
Try an index on
(racer_id, log_id)
.That way it doesn’t need to decide whether to get the selectivity of racer_id or the ordering of log_id. It can get both at the same time.