skip to Main Content

I am working on a transaction dataset with one date column & one identifier column. Now retroactively I want to create a boolean column to check whether any particular activity (e.g. marketing campaign) should have been sent for that identifier if the last time a campaign sent was before 30 days. I can use lag function but the problem is that the starting point would keep on changing.

For example, on first transaction date a campaign would be sent, then for any transaction within next 30 days no campaign; then, let’s say next transaction happens on 35th day from first transaction; then, send a campaign and now the new counter for next 30 days should start from 35th day.

I’m haven’t been able to figure out how I can achieve it. I am using redshift sql (table for reference below)

Identifier Date (YYYY-MM-DD time) desired column
A 2024-06-30 11:22:03 FALSE
A 2024-06-16 14:02:36 FALSE
A 2024-06-15 8:15:57 TRUE
A 2024-05-24 14:30:57 FALSE
A 2024-05-10 4:45:06 FALSE
A 2024-05-08 15:19:48 FALSE
A 2024-05-07 15:09:14 FALSE
A 2024-05-06 4:16:39 FALSE
A 2024-05-04 10:37:16 TRUE
A 2024-04-08 5:02:00 FALSE
A 2024-04-06 9:03:23 FALSE
A 2024-03-30 11:05:55 TRUE
A 2024-03-16 8:39:56 FALSE
A 2024-03-15 14:06:10 FALSE
A 2024-02-28 16:55:28 TRUE

Thanks for the help!

3

Answers


  1. Chosen as BEST ANSWER

    I tried kind of a hacky solution for now which worked -

    with min_ts as
    (
        select
            identifier,
            min(ts) as min_ts
        from my_table
        group by 1
    ),
    step_2 as
    (
        select
            a.*,
            floor(datediff(second, b.min_ts, a.ts)/2592000::DOUBLE PRECISION) as floor_days_between_ts
        from my_table a
        inner join min_ts b
        on a.identifier = b.identifier
    ),
    step_3 as
    (
        select
            identifier,
            floor_days_between_ts,
            min(ts) as min_ts
        from step_2
        group by 1,2
    )
    select
        a.*,
        case when a.ts = b.min_ts then True::bool
             else False::bool end as campaign
    from step_2 a
    left join step_3 b
    on a.identifier = b.identifier and a.floor_days_between_ts = b.floor_days_between_ts
    order by a.identifier, a.ts desc
    ;
    

  2. You are looking for an iteration. The first date is 2024-02-28. With this date you are looking for the first date that is after its 30 days range. That date is 2024-03-30. Then again you want to find the first date after its 30 day range, and so on.

    Iteration is done with recursive queries in SQL.

    with recursive
      starters (identifier, ts) as
      (
        select identifier, min(ts)
        from mytable
        group by identifier
         union all
        select t.identifier, min(t.ts)
        from starters s
        join mytable t on t.identifier = s.identifier
                      and t.ts > s.ts + interval '30' day
        group by t.identifier
      )
    select 
      t.*,
      exists 
      (
        select null
        from starters s
        where s.identifier = t.identifier
        and s.ts = t.ts
      ) as flag
    from mytable t 
    order by t.identifier, t.ts;
    

    If redshift is not yet capable of handling aggregation in recursive CTEs as you say below in the comments, then use another way to get the top 1 row per group. Getting the top 1 means there does not exist a better one, so you may be able to solve this with NOT EXISTS.

    The part

    select identifier, min(ts)
    from mytable
    group by identifier
    

    can also be written as

    select identifier, ts
    from mytable t
    where not exists
    (
      select null
      from mytable better
      where better.identifier = t.identifier
      and better.ts < t.ts
    )
    

    for instance.

    Login or Signup to reply.
  3. One option to do it is to find refference dates. You could try to do it combining dates and differences in days between them with a few ctes…

    --    S a m p l e    D a t a :
    Create Table tbl (identifier Varchar(6), dt TimeStamp);
    Insert Into tbl VALUES 
    ('A',   '2024-06-30 11:22:03'),
    ('A',   '2024-06-16 14:02:36'),
    ('A',   '2024-06-15 08:15:57'),
    ('A',   '2024-05-24 14:30:57'), 
    ('A',   '2024-05-10 04:45:06'),
    ('A',   '2024-05-08 15:19:48'), 
    ('A',   '2024-05-07 15:09:14'), 
    ('A',   '2024-05-06 04:16:39'), 
    ('A',   '2024-05-04 10:37:16'),
    ('A',   '2024-04-08 05:02:00'),
    ('A',   '2024-04-06 09:03:23'), 
    ('A',   '2024-03-30 11:05:55'), 
    ('A',   '2024-03-16 08:39:56'), 
    ('A',   '2024-03-15 14:06:10'),
    ('A',   '2024-02-28 16:55:28');
    

    … removed the time part and combined the dates (practicaly cross join) to get all day differences between them …

    WITH
      dates  as
        ( Select    Row_Number() Over(Partition By d.identifier, d.dt_0 Order By d.dt_1) as rn_0, 
                    d.identifier, d.dt_0, d.dt_1, d.diff
          From      ( Select t1.identifier, Date_Trunc('day', t1.dt) as dt_1, 
                             Case When DATE_PART('day', Date_Trunc('day', t1.dt) - Date_Trunc('day', t2.dt)) >= 30 
                                  Then Date_Trunc('day', t2.dt)
                             End as dt_0, 
                             DATE_PART('day', Date_Trunc('day', t1.dt) - Date_Trunc('day', t2.dt)) as diff,
                             Min(Date_Trunc('day', t1.dt)) Over(Partition By t1.identifier) as start_dt
                       From   tbl t1
                       Inner Join tbl t2 ON(t1.identifier = t2.identifier) 
                    ) d
          Where     diff >= 30 
       ), 
    

    … filtering the resultset …

      grid as 
        ( Select  Distinct d0.*, d1.dt_0 as dt_00, d1.dt_1 as dt_11
          From    dates d0
          Inner Join dates d1 on(d1.identifier = d0.identifier And 
                                 d1.dt_0 = d0.dt_1 And 
                                d0.rn_0 = 1 And d1.rn_0 = 1
                               )
          Where d0.rn_0 = 1 And d0.dt_1 = d1.dt_0
       ),
    

    … getting refference dates using UNION (not UNION ALL) to get ridd of duplicates …

      reff_dates as
       ( Select identifier, Min(dt_0) as dt_0 From grid Group By identifier UNION
         Select identifier, Min(dt_1)  From grid Group By identifier UNION
         Select identifier, Min(dt_11) From grid Group By identifier UNION
         Select Distinct identifier, dt_11 From grid
      )
    

    … reference dates are left joined to your data defining the values ‘TRUE’/’FALSE’ for the new column using Case expressions …

    --      M a i n    S Q L :
    Select  t.identifier, t.dt,
            Case When r.dt_0 Is Null Then 'FALSE'
            Else 'TRUE'
            End as new_column
    From    tbl t
    Left Join reff_dates r ON( r.dt_0 = Date_Trunc('day', t.dt) )
    Order By t.dt Desc
    
    /*      R e s u l t :
    identifier  dt                      new_column
    ---------   ----------------------  ----------
    A           2024-06-30 11:22:03     FALSE
    A           2024-06-16 14:02:36     FALSE
    A           2024-06-15 08:15:57     TRUE
    A           2024-05-24 14:30:57     FALSE
    A           2024-05-10 04:45:06     FALSE
    A           2024-05-08 15:19:48     FALSE
    A           2024-05-07 15:09:14     FALSE
    A           2024-05-06 04:16:39     FALSE
    A           2024-05-04 10:37:16     TRUE
    A           2024-04-08 05:02:00     FALSE
    A           2024-04-06 09:03:23     FALSE
    A           2024-03-30 11:05:55     TRUE
    A           2024-03-16 08:39:56     FALSE
    A           2024-03-15 14:06:10     FALSE
    A           2024-02-28 16:55:28     TRUE         */
    

    See the fiddle here.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search