Postgresql - Using nested lag or any other window function

trathi01
July 17, 2024
81 views
2 votes
3 Answers

I am working on a transaction dataset with one date column & one identifier column. Now retroactively I want to create a boolean column to check whether any particular activity (e.g. marketing campaign) should have been sent for that identifier if the last time a campaign sent was before 30 days. I can use lag function but the problem is that the starting point would keep on changing.

For example, on first transaction date a campaign would be sent, then for any transaction within next 30 days no campaign; then, let’s say next transaction happens on 35th day from first transaction; then, send a campaign and now the new counter for next 30 days should start from 35th day.

I’m haven’t been able to figure out how I can achieve it. I am using redshift sql (table for reference below)

Identifier	Date (YYYY-MM-DD time)	desired column
A	2024-06-30 11:22:03	FALSE
A	2024-06-16 14:02:36	FALSE
A	2024-06-15 8:15:57	TRUE
A	2024-05-24 14:30:57	FALSE
A	2024-05-10 4:45:06	FALSE
A	2024-05-08 15:19:48	FALSE
A	2024-05-07 15:09:14	FALSE
A	2024-05-06 4:16:39	FALSE
A	2024-05-04 10:37:16	TRUE
A	2024-04-08 5:02:00	FALSE
A	2024-04-06 9:03:23	FALSE
A	2024-03-30 11:05:55	TRUE
A	2024-03-16 8:39:56	FALSE
A	2024-03-15 14:06:10	FALSE
A	2024-02-28 16:55:28	TRUE

Thanks for the help!

Answers

Chosen as BEST ANSWER

I tried kind of a hacky solution for now which worked -

with min_ts as
(
    select
        identifier,
        min(ts) as min_ts
    from my_table
    group by 1
),
step_2 as
(
    select
        a.*,
        floor(datediff(second, b.min_ts, a.ts)/2592000::DOUBLE PRECISION) as floor_days_between_ts
    from my_table a
    inner join min_ts b
    on a.identifier = b.identifier
),
step_3 as
(
    select
        identifier,
        floor_days_between_ts,
        min(ts) as min_ts
    from step_2
    group by 1,2
)
select
    a.*,
    case when a.ts = b.min_ts then True::bool
         else False::bool end as campaign
from step_2 a
left join step_3 b
on a.identifier = b.identifier and a.floor_days_between_ts = b.floor_days_between_ts
order by a.identifier, a.ts desc
;

(Edit)

- ThorstenKettner
- July 16, 2024 at 7:10 pm
- 0 votes
0
You are looking for an iteration. The first date is 2024-02-28. With this date you are looking for the first date that is after its 30 days range. That date is 2024-03-30. Then again you want to find the first date after its 30 day range, and so on.

Iteration is done with recursive queries in SQL.
```
with recursive
  starters (identifier, ts) as
  (
    select identifier, min(ts)
    from mytable
    group by identifier
     union all
    select t.identifier, min(t.ts)
    from starters s
    join mytable t on t.identifier = s.identifier
                  and t.ts > s.ts + interval '30' day
    group by t.identifier
  )
select 
  t.*,
  exists 
  (
    select null
    from starters s
    where s.identifier = t.identifier
    and s.ts = t.ts
  ) as flag
from mytable t 
order by t.identifier, t.ts;
```
If redshift is not yet capable of handling aggregation in recursive CTEs as you say below in the comments, then use another way to get the top 1 row per group. Getting the top 1 means there does not exist a better one, so you may be able to solve this with NOT EXISTS.

The part
```
select identifier, min(ts)
from mytable
group by identifier
```
can also be written as
```
select identifier, ts
from mytable t
where not exists
(
  select null
  from mytable better
  where better.identifier = t.identifier
  and better.ts < t.ts
)
```
for instance.
Login or Signup to reply.

One option to do it is to find refference dates. You could try to do it combining dates and differences in days between them with a few ctes…

--    S a m p l e    D a t a :
Create Table tbl (identifier Varchar(6), dt TimeStamp);
Insert Into tbl VALUES 
('A',   '2024-06-30 11:22:03'),
('A',   '2024-06-16 14:02:36'),
('A',   '2024-06-15 08:15:57'),
('A',   '2024-05-24 14:30:57'), 
('A',   '2024-05-10 04:45:06'),
('A',   '2024-05-08 15:19:48'), 
('A',   '2024-05-07 15:09:14'), 
('A',   '2024-05-06 04:16:39'), 
('A',   '2024-05-04 10:37:16'),
('A',   '2024-04-08 05:02:00'),
('A',   '2024-04-06 09:03:23'), 
('A',   '2024-03-30 11:05:55'), 
('A',   '2024-03-16 08:39:56'), 
('A',   '2024-03-15 14:06:10'),
('A',   '2024-02-28 16:55:28');

… removed the time part and combined the dates (practicaly cross join) to get all day differences between them …

WITH
  dates  as
    ( Select    Row_Number() Over(Partition By d.identifier, d.dt_0 Order By d.dt_1) as rn_0, 
                d.identifier, d.dt_0, d.dt_1, d.diff
      From      ( Select t1.identifier, Date_Trunc('day', t1.dt) as dt_1, 
                         Case When DATE_PART('day', Date_Trunc('day', t1.dt) - Date_Trunc('day', t2.dt)) >= 30 
                              Then Date_Trunc('day', t2.dt)
                         End as dt_0, 
                         DATE_PART('day', Date_Trunc('day', t1.dt) - Date_Trunc('day', t2.dt)) as diff,
                         Min(Date_Trunc('day', t1.dt)) Over(Partition By t1.identifier) as start_dt
                   From   tbl t1
                   Inner Join tbl t2 ON(t1.identifier = t2.identifier) 
                ) d
      Where     diff >= 30 
   ),

… filtering the resultset …

  grid as 
    ( Select  Distinct d0.*, d1.dt_0 as dt_00, d1.dt_1 as dt_11
      From    dates d0
      Inner Join dates d1 on(d1.identifier = d0.identifier And 
                             d1.dt_0 = d0.dt_1 And 
                            d0.rn_0 = 1 And d1.rn_0 = 1
                           )
      Where d0.rn_0 = 1 And d0.dt_1 = d1.dt_0
   ),

… getting refference dates using UNION (not UNION ALL) to get ridd of duplicates …

  reff_dates as
   ( Select identifier, Min(dt_0) as dt_0 From grid Group By identifier UNION
     Select identifier, Min(dt_1)  From grid Group By identifier UNION
     Select identifier, Min(dt_11) From grid Group By identifier UNION
     Select Distinct identifier, dt_11 From grid
  )

… reference dates are left joined to your data defining the values ‘TRUE’/’FALSE’ for the new column using Case expressions …

--      M a i n    S Q L :
Select  t.identifier, t.dt,
        Case When r.dt_0 Is Null Then 'FALSE'
        Else 'TRUE'
        End as new_column
From    tbl t
Left Join reff_dates r ON( r.dt_0 = Date_Trunc('day', t.dt) )
Order By t.dt Desc

/*      R e s u l t :
identifier  dt                      new_column
---------   ----------------------  ----------
A           2024-06-30 11:22:03     FALSE
A           2024-06-16 14:02:36     FALSE
A           2024-06-15 08:15:57     TRUE
A           2024-05-24 14:30:57     FALSE
A           2024-05-10 04:45:06     FALSE
A           2024-05-08 15:19:48     FALSE
A           2024-05-07 15:09:14     FALSE
A           2024-05-06 04:16:39     FALSE
A           2024-05-04 10:37:16     TRUE
A           2024-04-08 05:02:00     FALSE
A           2024-04-06 09:03:23     FALSE
A           2024-03-30 11:05:55     TRUE
A           2024-03-16 08:39:56     FALSE
A           2024-03-15 14:06:10     FALSE
A           2024-02-28 16:55:28     TRUE         */

See the fiddle here.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Using nested lag or any other window function

Answers