I have three tables: conversations
, user_messages
and system_messages
. Their basic structure (irrelevant columns removed for brevity) is:
create table conversations(id int generated by default as identity primary key);
create table user_messages(
id int generated by default as identity primary key
,conversation_id int references conversations(id)
,content text
,created_at timestamp);
create table system_messages(
id int generated by default as identity primary key
,conversation_id int references conversations(id)
,content text
,created_at timestamp);
I am trying to generate an export of "system said X" and "user responded with Y" rows for a given conversation. The issue I’m having is finding a way to ensure that the "user responded with Y" message corresponds to the chronologically next message in the conversation, following the system message. So output would look something like:
system_said | user_said |
---|---|
Hi! | Hello |
How’s it going? | I’m alright, you? |
I’m good! | Nice. |
-
This works, but with a poor and inefficient query, using a sub-query that references outer values.
SELECT content AS system_said, ( SELECT content FROM user_messages WHERE conversation_id = sm.conversation_id AND created_at > sm.created_at ORDER BY created_at LIMIT 1 ) AS user_said FROM system_messages sm WHERE sm.conversation_id = ?
-
This is where I’m stuck. I’m trying to rewrite the above more efficiently using a proper join, but I can’t force the join to give me the chronologically next message, so I’m getting mixed up pairs like:
system_said user_said Hi! Nice. How’s it going? Hello I’m good! I’m alright, you? Here’s the query:
SELECT DISTINCT ON (sm) sm.id AS message, sm.content AS system_said, um.content AS user_said FROM system_messages sm JOIN ( SELECT conversation_id, created_at, content FROM user_messages ORDER BY created_at ) um ON um.conversation_id = sm.conversation_id AND um.created_at > sm.created_at ORDER BY sm WHERE c.id = ?
(I found I had to add in
DISTINCT ON ()
otherwise I got duplicate rows.)
Is there a way to this?
3
Answers
Use a
CTE
to rank user messages within each conversation based on their creation time and then use aJOIN
to pair each system message with the next user message using the row number. Try this:I would try:
Schema (PostgreSQL v15)
Query #1
View on DB Fiddle
Your idea with
distinct on
sounds good, but there are a few things you could improve:Distinct on(sm)
checks the entire recordFROM system_messages sm
. Since it has a primary key,distinct on(sm.id)
is sufficient.order by
you use withdistinct on
could use more than just the unique column. If you want the soonest response, add its timestamp:order by sm.id,um.created_at
.In a demo at db<>fiddle with 20k conversations and 200k messages in them, this takes 40ms:
You can also add covering indexes to speed things up with index-only scans, which push all this way below 1ms exec time:
With those in place, it looks like your initial idea wins performance wise, whether you push down the correlated scalar subquery down to be cross joined with, as a
lateral
suqbuery doing the exact same thing, or not: demo2 at db<>fiddle