In a Postgres db, I have a view called school_id_cards
. The view captures a list of id cards per school and student in a school system. ID cards are periodically reissued to students, so there may be any number of cards per student. Each card_id
is unique. A student may belong to multiple schools. A sampling of records looks like the following:
card_id | school_id | student_id |
---|---|---|
1 | 1 | 1 |
2 | 1 | 1 |
3 | 1 | 2 |
4 | 1 | 2 |
5 | 1 | 3 |
6 | 1 | 3 |
7 | 1 | 4 |
8 | 1 | 4 |
9 | 2 | 5 |
10 | 2 | 5 |
11 | 2 | 6 |
12 | 3 | 7 |
Given a list of school_ids
, I want to retrieve the list of the most recently created id cards for each school+student, limited to a select number of students per school_id
.
I have the following query to get me what I need without any limiting:
SELECT card_id FROM school_id_cards
WHERE card_id IN (
SELECT MAX(card_id) FROM school_id_cards
WHERE
school_id in (1, 2, 3)
GROUP BY
school_id,
student_id
);
… which, for the above sample returns (2,4,6,8,10,11,12)
However, in my sub-squery, I’d like to limit the number of school+student records returned for each school_id
listed in the where clause. For instance, a limit of 2 to get up to the 2 most recently added students for school 1, up to the 2 most recently added students for school 2, and up to the 2 most recently added students for school 3. In this case, an end result of (6,8,10,11,12)
.
Is there a query that will accomplish this?
2
Answers
You can do this by building up a query of subqueries and
dense_rank
.First, get the latest cards for each student/school. (Note: IDs are poor surrogates for time ordering. Add a datetime column.)
Next, we use that as a subquery to get the order of the most recent cards issued to students by school. The most recent cards have
student_card_order = 1
.And, finally, we can fetch only the first two per school.
school_card_order <= 2;
Demonstration.
There may be a more compact or performant way to do it, but window functions and subqueries are a way to break down complex queries.
If your table is big you want to avoid expensive whole-table sequential scans. Use a smart query to pick qualifying rows with index(-only) scans from a matching index. Hugely faster.
Typically, there should exist some kind of "school" table in your DB with exactly one row per relevant school. Makes the query simpler and faster:
fiddle
This scales well for a small limit per school like you demonstrated. For a large limit, I would switch to a different query.
About the use of a recursive CTE (rCTE):
Be sure to have a matching index like
A simpler index with (default) ascending sort order is barely any worse. Postgres can scan B-tree indices backwards. Only opposing sort order would be less ideal.
If there is no
school
table:About
DISTINCT ON
:You could replace the non-recursive term with another, nested rCTE to generate the list of schools (possibly with the latest card to kick things off) …
But there really should be a
school
table. Create it if you don’t have it.