skip to Main Content

Doing self join with same table with range condition. The below is sql statement.

select 
current_rec .*
,compare_rec.employee_id as less_salary_employee
from 
hr.employees current_rec
left join hr.employees compare_rec 
on current_rec.hire_date  < compare_rec .hire_date and current_rec.salary < compare_rec.salary 
order by current_rec .employee_id ;

for 300K records, it is taking more 45~60 min. After googling, i came to know that Binned range join optimization will help. i read this article to improve query performance.
[https://select.dev/posts/snowflake-range-join-optimization#how-to-optimize-range-joins-in-snowflake][1] . But not able to understanding. can someone explain it? . Specially binned range optimization part . if you provide visual representation that will help

2

Answers


  1. Create a CTE to bin the employees table into ranges based on hire date
    Join the CTE against itself, filtering on current_rec.bin >= compare_rec.bin to only join relevant bins
    Add the additional filter on salary to complete the range join logic
    This allows joining in a more optimized way by avoiding comparing rows across irrelevant bins.

    WITH bins AS (
      SELECT *,
        CASE 
          WHEN hire_date < '2010-01-01' THEN 1
          WHEN hire_date >= '2010-01-01' AND hire_date < '2015-01-01' THEN 2
          WHEN hire_date >= '2015-01-01' AND hire_date < '2020-01-01' THEN 3
          WHEN hire_date >= '2020-01-01' THEN 4
        END AS bin
      FROM hr.employees
    )
    
    SELECT 
      current_rec.*, 
      compare_rec.employee_id AS less_salary_employee
    FROM bins current_rec
    LEFT JOIN bins compare_rec
      ON  current_rec.bin >= compare_rec.bin
      AND current_rec.salary < compare_rec.salary
    ORDER BY current_rec.employee_id;
    
    Login or Signup to reply.
  2. A table with 300K records which is self joinded, results in 90 billion operations. I am astonished that it only took one hour to performe this amount.

    First we do a estimation by grouping the data, then we use this result to obtain the exact solution.

    To optimize this task it might be helpful to group the table by category. Do you really need to compare the exact date of hiring and the exact salary? I would suggest truncating the hiring day to the month and rounding the salary to $1000. This resulted in my case to 12K records. This still results in over 100K operations, but this query runs in a minute.

    #create table Test.worker2 as
    With tbl as
    (
    SELECT
      round(salary/1000)*1000 salary, 
      date_trunc(hire_date,month) as hire_month,
      COUNT(1) as counts,
      array_agg(id) id
    FROM
      Test.worker
    GROUP BY 1,2
    ORDER BY 1,2
    )
    
    Select * except(B),
    B.*
    from(
    Select A.* ,
    ((Select struct(count(1) as LessSalaray_id_counts,ARRAY_CONCAT_AGG(B.id) as LessSalaray_id) from tbl B where A.hire_month<B.hire_month and A.salary<B.salary
    )) as B
    from tbl A
    )
    

    For obtaining the exact calculation, please remove the # in the fist line and replace both the < by <= to export the data in a new table of several gb size. Then run this query

    CREATE OR REPLACE TABLE Test.worker3 AS
    SELECT a.*,B_id
    FROM Test.worker2,
      UNNEST(id) AS A_id,
      UNNEST(lesssalaray_id) B_id
    LEFT JOIN Test.worker a
    ON A_id=a.id
    LEFT JOIN Test.worker b
    ON  B_id=b.id
    WHERE 
      a.hire_date<b.hire_date
      AND a.salary<b.salary
    

    which will run quite long, but after about 20 minutes, you obtain the extact answer.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search