Postgresql - Select N latest rows per school, but skip duplicates from the same student

askingquestions
October 15, 2023
108 views
0 votes
2 Answers

In a Postgres db, I have a view called school_id_cards. The view captures a list of id cards per school and student in a school system. ID cards are periodically reissued to students, so there may be any number of cards per student. Each card_id is unique. A student may belong to multiple schools. A sampling of records looks like the following:

card_id	school_id	student_id
1	1	1
2	1	1
3	1	2
4	1	2
5	1	3
6	1	3
7	1	4
8	1	4
9	2	5
10	2	5
11	2	6
12	3	7

Given a list of school_ids, I want to retrieve the list of the most recently created id cards for each school+student, limited to a select number of students per school_id.

I have the following query to get me what I need without any limiting:

SELECT card_id FROM school_id_cards
WHERE card_id IN (
  SELECT MAX(card_id) FROM school_id_cards
  WHERE
    school_id in (1, 2, 3)
  GROUP BY
    school_id,
    student_id
);

… which, for the above sample returns (2,4,6,8,10,11,12)

However, in my sub-squery, I’d like to limit the number of school+student records returned for each school_id listed in the where clause. For instance, a limit of 2 to get up to the 2 most recently added students for school 1, up to the 2 most recently added students for school 2, and up to the 2 most recently added students for school 3. In this case, an end result of (6,8,10,11,12).

Is there a query that will accomplish this?

Answers

You can do this by building up a query of subqueries and dense_rank.

First, get the latest cards for each student/school. (Note: IDs are poor surrogates for time ordering. Add a datetime column.)

select 
  *,
  dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
from school_id_cards

Next, we use that as a subquery to get the order of the most recent cards issued to students by school. The most recent cards have student_card_order = 1.

with ordered_student_cards as (
  select 
    *,
    dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
  from school_id_cards
)
select
  *,
  dense_rank() over(partition by school_id order by card_id desc) as school_card_order
from ordered_student_cards
where student_card_order = 1

And, finally, we can fetch only the first two per school. school_card_order <= 2;

with ordered_student_cards as (
  select 
    *,
    dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
  from school_id_cards
), ordered_school_cards as (
  select
    *,
    dense_rank() over(partition by school_id order by card_id desc) as school_card_order
  from ordered_student_cards
  where student_card_order = 1
)
select card_id
from ordered_school_cards
where school_card_order <= 2;

Demonstration.

There may be a more compact or performant way to do it, but window functions and subqueries are a way to break down complex queries.

If your table is big you want to avoid expensive whole-table sequential scans. Use a smart query to pick qualifying rows with index(-only) scans from a matching index. Hugely faster.

Typically, there should exist some kind of "school" table in your DB with exactly one row per relevant school. Makes the query simpler and faster:

WITH RECURSIVE latest_card AS (
   SELECT c.*
   FROM   school s
   CROSS  JOIN LATERAL (
      SELECT c.school_id, c.card_id, ARRAY[c.student_id] AS leading_ids
      FROM   school_id_cards c
      WHERE  c.school_id = s.school_id
      ORDER  BY c.card_id DESC
      LIMIT  1
      ) c

   UNION ALL
   SELECT c.*
   FROM   latest_card l
   JOIN   LATERAL (
      SELECT l.school_id, c.card_id, l.leading_ids || student_id
      FROM   school_id_cards c
      WHERE  c.school_id = l.school_id
      AND    c.card_id < l.card_id
      AND    c.student_id <> ALL (l.leading_ids)
      ORDER  BY c.card_id DESC
      LIMIT  1
      ) C ON cardinality(l.leading_ids) < 2  -- your limit per school here!
   )
SELECT card_id
FROM   latest_card
ORDER  BY card_id;

fiddle

This scales well for a small limit per school like you demonstrated. For a large limit, I would switch to a different query.

About the use of a recursive CTE (rCTE):

Optimize GROUP BY query to retrieve latest row per user

Be sure to have a matching index like

CREATE INDEX ON school_id_cards (school_id DESC, card_id DESC);

A simpler index with (default) ascending sort order is barely any worse. Postgres can scan B-tree indices backwards. Only opposing sort order would be less ideal.

If there is no school table:

WITH RECURSIVE latest_card AS (
   (
   SELECT DISTINCT ON (school_id)
          school_id, card_id, ARRAY[student_id] AS leading_ids
   FROM   school_id_cards c
   ORDER  BY school_id DESC, card_id DESC
   )

   UNION ALL
   SELECT c.*
   FROM   latest_card l
   JOIN   LATERAL (
      SELECT l.school_id, c.card_id, l.leading_ids || student_id
      FROM   school_id_cards c
      WHERE  c.school_id = l.school_id
      AND    c.card_id < l.card_id
      AND    c.student_id <> ALL (l.leading_ids)
      ORDER  BY c.card_id DESC
      LIMIT  1
      ) C ON cardinality(l.leading_ids) < 2  -- your limit per school here!
   )
SELECT card_id
FROM   latest_card
ORDER  BY card_id;

About DISTINCT ON:

Select first row in each GROUP BY group?

You could replace the non-recursive term with another, nested rCTE to generate the list of schools (possibly with the latest card to kick things off) …
But there really should be a school table. Create it if you don’t have it.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Select N latest rows per school, but skip duplicates from the same student

Answers