skip to Main Content

I am really struggling with producing some code create ‘level1 question’ & ‘level2 question’ columns.

The logic to populate the columns is:

  • ‘L1’ when level = 1 & type = group then question
  • ‘L2’ when level = 2 & type = group then question

But I can not find an efficient way of creating ‘level1 question’ & ‘level2 question’ like in the example.

id sort type question level answer L1 L2 level1 question level2 question
1 1 Group Location 1 NULL Location NULL Location NULL
1 2 Question UK 2 Yes NULL NULL Location NULL
1 3 Question EU 2 No NULL NULL Location NULL
1 4 Group Asia 2 Yes NULL ASIA Location ASIA
1 5 Question AUS 3 Yes NULL NULL Location ASIA
1 6 Question NZ 3 No NULL NULL Location ASIA
1 7 Group Season 1 NULL Season NULL Season NULL
1 8 Question Summer 2 Yes NULL NULL Season NULL
2 1 Group City 1 NULL City NULL City NULL
2 2 Question London 2 Yes NULL NULL City NULL
2 3 Group Street 2 NULL NULL Street City Street
2 4 Question King ST 3 No NULL NULL City Street

I have tried to use the following code but it is so slow. Just taken ~2 hours over 5 million rows.

'level1 question' = 
                (select top 1 L1 
                from table2 t2
                where L1 is not null
                and t1.Sort >= t2.Sort and t1.ID = t2.ID
                order by id , sort DESC)
, 'level2 question' = 
            (select top 1 L2 
            from table2 t2
            where L2 is not null AND (Level >2 OR Type = 'Group')
            and t1.Sort >= t2.Sort and t1.ID = t2.ID
            order by id , sort DESC)

2

Answers


  1. I think you’re experiencing an overflow in memory due to the self join. Create a CTE or 2 that removes the rows with null and then join the CTE(s) it to the table instead. I bet if you looked up your execution plan, you have nested loops and sorts that are killing hour query. My solution should reduce the impact of those.

    Login or Signup to reply.
  2. The syntax of your code suggests that you are running SQL Server.

    In the 2022 version, we can use last_value() with ignore nulls to efficiently retrieve the latest non-null of the conditional computation:

    select t.*,
        last_value(case when level = 1 and type = 'Group' then question end) 
            ignore nulls
            over(partition by id order by sort) as level1_question,
        last_value(case when level = 2 and type = 'Group' then question end) 
            ignore nulls
            over(partition by id order by sort) as level2_question
    from mytable t
    

    In earlier versions, that do not support ignore nulls, we can still use window functions (which should be more efficient that multiple subqueries on a large dataset), but that requires more work. The idea is to define groups of rows that contain one non-null value followed by 0 to N null values:

    select t.*,
        max(question1) over(partition by id, grp1 order by sort) as level1_question,
        max(question2) over(partition by id, grp2 order by sort) as level2_question
    from (
        select t.*,
            count(question1) over(partition by id order by sort) grp1,
            count(question2) over(partition by id order by sort) grp2
        from mytable t
        cross apply (
            values (
                case when level = 1 and type = 'Group' then question end,
                case when level = 2 and type = 'Group' then question end
            ) 
        ) as v(question1, question2)
    ) t
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search