skip to Main Content

here is my dataset

processed_on|status |
------------+-------+
2023-01-01 |Success|
2023-01-02 |Success|
2023-01-03 |Success|
2023-01-04 |Fail   |
2023-01-05 |Fail   |
2023-01-06 |Success|
2023-01-07 |Fail   |
2023-01-08 |Success|
2023-01-09 |Success|

expected output is

--------------------------
start_date|end_date|status
--------------------------
2023-01-01|2023-01-03|Success
2023-01-04|2023-01-05|Fail
2023-01-06|2023-01-06|Success
2023-01-07|023-01-07|Fail
2023-01-08|2023-01-09|Success

i tried window function like led and lag .but no luck..

2

Answers


  1. You can use SQL window function to get the result you want in a few steps.

    Step 1. First use window function to compare previous status and current status, if they are different then mark it as status bucket starting point.

    Step 2. Then use window function to count how many status_bucket_starting_point we have reach so far, define it as status bucket number.

    Step 3. At last, group by stauts bucket number and use min()/max() function to get start date and end date of the bucket,

    Here is the PostgreSQL query:

    with data_with_bucket_start AS (
    select
        processed_on,
        status,
        case when lag(status) over (order by processed_on) = status then 0 else 1 end as status_bucket_start   
    from
        data_table
    ),
    data_with_bucket AS (
    select
        processed_on,
        status,
        sum(status_bucket_start) over (order by processed_on) as status_bucket
    from 
        data_with_bucket_start
    )
    select
        min(processed_on) as start_date,
        max(processed_on) as end_date,
        max(status) as stauts
    from
        data_with_bucket
    group by
        status_bucket
    order by
        status_bucket
    
    start_date end_date stauts
    2023-01-01 2023-01-03 Success
    2023-01-04 2023-01-05 Fail
    2023-01-06 2023-01-06 Success
    2023-01-07 2023-01-07 Fail
    2023-01-08 2023-01-09 Success
    Login or Signup to reply.
  2. This is a typical gaps-and-islands problem. Here, islands are groups of consecutive rows sharing the same status. We can identify them with the difference of row numbers, then aggregate:

    select min(processed_on) start_date, max(processed_on) end_date, status
    from (
        select t.*, 
            row_number() over(order by processed_on) rn1,
            row_number() over(partition by status order by processed_on) rn2
        from mytable t
    ) t
    group by status, rn1 - rn2
    order by start_date
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search