Assume I have two tables a
and b
, each containing a series of dates:
Table a
:
day_date |
---|
2020-1-1 |
2020-1-2 |
2020-1-3 |
2020-1-4 |
2020-1-5 |
2020-1-6 |
2020-1-7 |
2020-1-8 |
2020-1-9 |
2020-1-10 |
and table b
:
id | some_date |
---|---|
0 | 2020-1-3 |
0 | 2020-1-6 |
0 | 2020-1-8 |
1 | 2020-1-2 |
1 | 2020-1-5 |
I want to create a new table c
which contains both day_date
, id
and some_date
, but now some_date
contains only values such that the minimum value above some given threshold is used, i.e.:
day_date | id | next_date |
---|---|---|
2020-1-1 | 0 | 2020-1-3 |
2020-1-2 | 0 | 2020-1-3 |
2020-1-3 | 0 | 2020-1-6 |
2020-1-4 | 0 | 2020-1-6 |
2020-1-5 | 0 | 2020-1-6 |
2020-1-6 | 0 | 2020-1-8 |
2020-1-7 | 0 | 2020-1-8 |
2020-1-8 | 0 | null |
2020-1-9 | 0 | null |
2020-1-10 | 0 | null |
2020-1-1 | 1 | 2020-1-2 |
2020-1-2 | 1 | 2020-1-5 |
2020-1-3 | 1 | 2020-1-5 |
2020-1-4 | 1 | 2020-1-5 |
2020-1-5 | 1 | null |
2020-1-6 | 1 | null |
2020-1-7 | 1 | null |
2020-1-8 | 1 | null |
2020-1-9 | 1 | null |
2020-1-10 | 1 | null |
My idea was to filter a cross join of the two, e.g.:
CREATE TEMP TABLE some_next AS (
SELECT
day_date,
id,
CASE WHEN some_date > day_date THEN some_date ELSE NULL END AS next_churn_date,
ROW_NUMBER() OVER (PARTITION BY day_date, id ORDER BY some_date ASC) AS rn_next
FROM a CROSS JOIN b
WHERE some_date > day_date OR day_date >= (SELECT MAX(some_date) FROM b bb WHERE bb.id = b.id
);
CREATE TABLE c AS (
SELECT * FROM some_next WHERE rn_next = 1 ORDER BY id, day_date
);
I was looking for a simpler solution. Any ideas?
2
Answers
We can start with a cross join to get all combinations of IDs and days.
Then use that to rank the some_days proximity to the days (dropping the order by, it’s unnecessary).
And select only those rows with a row_number of 1.
Demonstration.
A simpler solution – tested on Postgres, not Redshift (which is not Postgres at all):
fiddle
With an index on
b(id, some_date)
it can perform decently, as the correlated subquery results in one very fast index-only scan for each result row.The is a (much) faster way to get distinct
b.id
if there are only few distinct values in a big table. See: