skip to Main Content

Assume I have two tables a and b, each containing a series of dates:

Table a:

day_date
2020-1-1
2020-1-2
2020-1-3
2020-1-4
2020-1-5
2020-1-6
2020-1-7
2020-1-8
2020-1-9
2020-1-10

and table b:

id some_date
0 2020-1-3
0 2020-1-6
0 2020-1-8
1 2020-1-2
1 2020-1-5

I want to create a new table c which contains both day_date, id and some_date, but now some_date contains only values such that the minimum value above some given threshold is used, i.e.:

day_date id next_date
2020-1-1 0 2020-1-3
2020-1-2 0 2020-1-3
2020-1-3 0 2020-1-6
2020-1-4 0 2020-1-6
2020-1-5 0 2020-1-6
2020-1-6 0 2020-1-8
2020-1-7 0 2020-1-8
2020-1-8 0 null
2020-1-9 0 null
2020-1-10 0 null
2020-1-1 1 2020-1-2
2020-1-2 1 2020-1-5
2020-1-3 1 2020-1-5
2020-1-4 1 2020-1-5
2020-1-5 1 null
2020-1-6 1 null
2020-1-7 1 null
2020-1-8 1 null
2020-1-9 1 null
2020-1-10 1 null

My idea was to filter a cross join of the two, e.g.:

CREATE TEMP TABLE some_next AS (
  SELECT
    day_date,
    id,
    CASE WHEN some_date > day_date THEN some_date ELSE NULL END AS next_churn_date,
    ROW_NUMBER() OVER (PARTITION BY day_date, id ORDER BY some_date ASC) AS rn_next
  FROM a CROSS JOIN b
  WHERE some_date > day_date OR day_date >= (SELECT MAX(some_date) FROM b bb WHERE bb.id = b.id
);

CREATE TABLE c AS (
  SELECT * FROM some_next WHERE rn_next = 1 ORDER BY id, day_date
);

I was looking for a simpler solution. Any ideas?

2

Answers


  1. We can start with a cross join to get all combinations of IDs and days.

    select
      day,
      id
    from days
    cross join ( select distinct id from some_days )
    order by id, day;
    

    Then use that to rank the some_days proximity to the days (dropping the order by, it’s unnecessary).

    with days_ids as (
      select
        day,
        id
      from days
      cross join ( select distinct id from some_days )
    )
    select
      di.day as day,
      di.id,
      sd.day as some_day,
      row_number() over (
        partition by di.day, di.id
        order by sd.day asc
      ) as row_num
    from days_ids di
    left join some_days sd on di.day < sd.day and di.id = sd.id
    order by di.id, di.day, row_num
    

    And select only those rows with a row_number of 1.

    with days_ids as (
      select
        day,
        id
      from days
      cross join ( select distinct id from some_days )
    ), 
    matched_days as (
      select
        di.day as day,
        di.id,
        sd.day as some_day,
        row_number() over (
          partition by di.day, di.id
          order by sd.day asc
        ) as row_num
      from days_ids di
      left join some_days sd on di.day < sd.day and di.id = sd.id
    )
    select day, id, some_day
    from matched_days
    where row_num = 1
    order by id, day
    

    Demonstration.

    Login or Signup to reply.
  2. A simpler solution – tested on Postgres, not Redshift (which is not Postgres at all):

    SELECT a.day_date, b1.id
        , (SELECT b.some_date
           FROM   b
           WHERE  b.id = b1.id
           AND    b.some_date > a.day_date
           ORDER  BY b.some_date
           LIMIT  1) AS next_date
    FROM   a
    CROSS  JOIN (SELECT DISTINCT id FROM b) b1
    ORDER  BY 2, 1;
    

    fiddle

    With an index on b(id, some_date) it can perform decently, as the correlated subquery results in one very fast index-only scan for each result row.
    The is a (much) faster way to get distinct b.id if there are only few distinct values in a big table. See:

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search