Doing self join with same table with range condition. The below is sql statement.
select
current_rec .*
,compare_rec.employee_id as less_salary_employee
from
hr.employees current_rec
left join hr.employees compare_rec
on current_rec.hire_date < compare_rec .hire_date and current_rec.salary < compare_rec.salary
order by current_rec .employee_id ;
for 300K records, it is taking more 45~60 min. After googling, i came to know that Binned range join optimization will help. i read this article to improve query performance.
[https://select.dev/posts/snowflake-range-join-optimization#how-to-optimize-range-joins-in-snowflake][1] . But not able to understanding. can someone explain it? . Specially binned range optimization part . if you provide visual representation that will help
2
Answers
Create a CTE to bin the employees table into ranges based on hire date
Join the CTE against itself, filtering on current_rec.bin >= compare_rec.bin to only join relevant bins
Add the additional filter on salary to complete the range join logic
This allows joining in a more optimized way by avoiding comparing rows across irrelevant bins.
A table with 300K records which is self joinded, results in 90 billion operations. I am astonished that it only took one hour to performe this amount.
First we do a estimation by grouping the data, then we use this result to obtain the exact solution.
To optimize this task it might be helpful to group the table by category. Do you really need to compare the exact date of hiring and the exact salary? I would suggest truncating the hiring day to the month and rounding the salary to $1000. This resulted in my case to 12K records. This still results in over 100K operations, but this query runs in a minute.
For obtaining the exact calculation, please remove the
#
in the fist line and replace both the<
by<=
to export the data in a new table of several gb size. Then run this querywhich will run quite long, but after about 20 minutes, you obtain the extact answer.