I have a PostgreSQL table with a bunch of events, each containing a timestamp, id, and action. I expect a few million events per day, and expect the table to grow to several billion rows eventually.
I want to query the most recent events for each item (a few hundred thousand distinct ids) before a given date, but currently the query is extremely slow, taking 1-2 hours minimum (the table currently has around 100 million rows). Is there a way to speed up this query?
SELECT a.* FROM (
with events as (
SELECT
ROW_NUMBER() OVER (PARTITION BY item ORDER BY time_stamp DESC) AS rn,
*
FROM event_updates
WHERE time_stamp < '2023-05-01'
)
SELECT * FROM events WHERE rn=1 ORDER BY item
) a;
DDL for table
CREATE TABLE "event_updates" (
"id" int4 NOT NULL DEFAULT nextval("event_updates"::regclass),
"time_stamp" timestamptz(6) NOT NULL DEFAULT CURRENT_TIMESTAMP,
"item" varchar(32) COLLATE "pg_catalog"."default" NOT NULL,
"event_type" int2,
)
PARTITION BY ();
ALTER TABLE "event_updates" OWNER TO "owners";
2
Answers
First of all, you need an index on item and the timestamp. Second, you select the max(timestamp) for each item and then select the record you need:
You might want to create the index per partition, concurrently, to avoid locking issues. It takes longer however.
Emulate an index skip scan fro best performance:
In combination with a matching index (essential):
See:
More optimization may be possible.
I’ll explain some more after you have disclosed requested information.