I’m working on a SQL solution to squeeze the number of rows when there are matching "start" and "close" values across records. Here is an example input and the required output
Input
name | start | close |
---|---|---|
A | 120 | 130 |
A | 130 | 140 |
A | 140 | 150 |
A | 152 | 160 |
A | 160 | 180 |
B | 100 | 130 |
B | 130 | 200 |
B | 202 | 250 |
C | 300 | 400 |
Required output
name | start | close |
---|---|---|
A | 120 | 150 |
A | 152 | 180 |
B | 100 | 200 |
B | 202 | 250 |
C | 300 | 400 |
I tried with the lag()
function, but I did not get the correct output. The null boundaries are crossed.
with
t1 (
select name, start, close, lag(close) over(partition by name order by start) pclose from event
),
t2 (
select * from t1 where 1 = (case when pclose is null then 1
when start = pclose then 0 else 1 end)
)
select * from t2 order by name, start
Any standard complaint solution is welcome as I can port it easily to Spark.
2
Answers
Again, a classic
gaps-and-islands
problem.Here’s how you can achieve the result:
Here’s the SQL version:
That is sequential, so the start of one record must be the close of another.
try this:
Only works when two events do not overlap.
Result: