skip to Main Content

In a Postgres db, I have a view called school_id_cards. The view captures a list of id cards per school and student in a school system. ID cards are periodically reissued to students, so there may be any number of cards per student. Each card_id is unique. A student may belong to multiple schools. A sampling of records looks like the following:

card_id school_id student_id
1 1 1
2 1 1
3 1 2
4 1 2
5 1 3
6 1 3
7 1 4
8 1 4
9 2 5
10 2 5
11 2 6
12 3 7

Given a list of school_ids, I want to retrieve the list of the most recently created id cards for each school+student, limited to a select number of students per school_id.

I have the following query to get me what I need without any limiting:

SELECT card_id FROM school_id_cards
WHERE card_id IN (
  SELECT MAX(card_id) FROM school_id_cards
  WHERE
    school_id in (1, 2, 3)
  GROUP BY
    school_id,
    student_id
);

… which, for the above sample returns (2,4,6,8,10,11,12)

However, in my sub-squery, I’d like to limit the number of school+student records returned for each school_id listed in the where clause. For instance, a limit of 2 to get up to the 2 most recently added students for school 1, up to the 2 most recently added students for school 2, and up to the 2 most recently added students for school 3. In this case, an end result of (6,8,10,11,12).

Is there a query that will accomplish this?

2

Answers


  1. You can do this by building up a query of subqueries and dense_rank.

    First, get the latest cards for each student/school. (Note: IDs are poor surrogates for time ordering. Add a datetime column.)

    select 
      *,
      dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
    from school_id_cards
    

    Next, we use that as a subquery to get the order of the most recent cards issued to students by school. The most recent cards have student_card_order = 1.

    with ordered_student_cards as (
      select 
        *,
        dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
      from school_id_cards
    )
    select
      *,
      dense_rank() over(partition by school_id order by card_id desc) as school_card_order
    from ordered_student_cards
    where student_card_order = 1
    

    And, finally, we can fetch only the first two per school. school_card_order <= 2;

    with ordered_student_cards as (
      select 
        *,
        dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
      from school_id_cards
    ), ordered_school_cards as (
      select
        *,
        dense_rank() over(partition by school_id order by card_id desc) as school_card_order
      from ordered_student_cards
      where student_card_order = 1
    )
    select card_id
    from ordered_school_cards
    where school_card_order <= 2;
    

    Demonstration.

    There may be a more compact or performant way to do it, but window functions and subqueries are a way to break down complex queries.

    Login or Signup to reply.
  2. If your table is big you want to avoid expensive whole-table sequential scans. Use a smart query to pick qualifying rows with index(-only) scans from a matching index. Hugely faster.

    Typically, there should exist some kind of "school" table in your DB with exactly one row per relevant school. Makes the query simpler and faster:

    WITH RECURSIVE latest_card AS (
       SELECT c.*
       FROM   school s
       CROSS  JOIN LATERAL (
          SELECT c.school_id, c.card_id, ARRAY[c.student_id] AS leading_ids
          FROM   school_id_cards c
          WHERE  c.school_id = s.school_id
          ORDER  BY c.card_id DESC
          LIMIT  1
          ) c
    
       UNION ALL
       SELECT c.*
       FROM   latest_card l
       JOIN   LATERAL (
          SELECT l.school_id, c.card_id, l.leading_ids || student_id
          FROM   school_id_cards c
          WHERE  c.school_id = l.school_id
          AND    c.card_id < l.card_id
          AND    c.student_id <> ALL (l.leading_ids)
          ORDER  BY c.card_id DESC
          LIMIT  1
          ) C ON cardinality(l.leading_ids) < 2  -- your limit per school here!
       )
    SELECT card_id
    FROM   latest_card
    ORDER  BY card_id;
    

    fiddle

    This scales well for a small limit per school like you demonstrated. For a large limit, I would switch to a different query.

    About the use of a recursive CTE (rCTE):

    Be sure to have a matching index like

    CREATE INDEX ON school_id_cards (school_id DESC, card_id DESC);
    

    A simpler index with (default) ascending sort order is barely any worse. Postgres can scan B-tree indices backwards. Only opposing sort order would be less ideal.

    If there is no school table:

    WITH RECURSIVE latest_card AS (
       (
       SELECT DISTINCT ON (school_id)
              school_id, card_id, ARRAY[student_id] AS leading_ids
       FROM   school_id_cards c
       ORDER  BY school_id DESC, card_id DESC
       )
    
       UNION ALL
       SELECT c.*
       FROM   latest_card l
       JOIN   LATERAL (
          SELECT l.school_id, c.card_id, l.leading_ids || student_id
          FROM   school_id_cards c
          WHERE  c.school_id = l.school_id
          AND    c.card_id < l.card_id
          AND    c.student_id <> ALL (l.leading_ids)
          ORDER  BY c.card_id DESC
          LIMIT  1
          ) C ON cardinality(l.leading_ids) < 2  -- your limit per school here!
       )
    SELECT card_id
    FROM   latest_card
    ORDER  BY card_id;
    

    About DISTINCT ON:

    You could replace the non-recursive term with another, nested rCTE to generate the list of schools (possibly with the latest card to kick things off) …
    But there really should be a school table. Create it if you don’t have it.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search