Postgresql - How to speed up range joins in bigquery

LearnHadoop
August 12, 2023
317 views
1 vote
2 Answers

Doing self join with same table with range condition. The below is sql statement.

select 
current_rec .*
,compare_rec.employee_id as less_salary_employee
from 
hr.employees current_rec
left join hr.employees compare_rec 
on current_rec.hire_date  < compare_rec .hire_date and current_rec.salary < compare_rec.salary 
order by current_rec .employee_id ;

for 300K records, it is taking more 45~60 min. After googling, i came to know that Binned range join optimization will help. i read this article to improve query performance.
[https://select.dev/posts/snowflake-range-join-optimization#how-to-optimize-range-joins-in-snowflake][1] . But not able to understanding. can someone explain it? . Specially binned range optimization part . if you provide visual representation that will help

Answers

- sep7696
- August 12, 2023 at 3:25 pm
- 0 votes
0
Create a CTE to bin the employees table into ranges based on hire date
Join the CTE against itself, filtering on current_rec.bin >= compare_rec.bin to only join relevant bins
Add the additional filter on salary to complete the range join logic
This allows joining in a more optimized way by avoiding comparing rows across irrelevant bins.
```
WITH bins AS (
  SELECT *,
    CASE 
      WHEN hire_date < '2010-01-01' THEN 1
      WHEN hire_date >= '2010-01-01' AND hire_date < '2015-01-01' THEN 2
      WHEN hire_date >= '2015-01-01' AND hire_date < '2020-01-01' THEN 3
      WHEN hire_date >= '2020-01-01' THEN 4
    END AS bin
  FROM hr.employees
)

SELECT 
  current_rec.*, 
  compare_rec.employee_id AS less_salary_employee
FROM bins current_rec
LEFT JOIN bins compare_rec
  ON  current_rec.bin >= compare_rec.bin
  AND current_rec.salary < compare_rec.salary
ORDER BY current_rec.employee_id;
```
Login or Signup to reply.

- Samuel
- August 12, 2023 at 3:44 pm
- 0 votes
0
A table with 300K records which is self joinded, results in 90 billion operations. I am astonished that it only took one hour to performe this amount.

First we do a estimation by grouping the data, then we use this result to obtain the exact solution.

To optimize this task it might be helpful to group the table by category. Do you really need to compare the exact date of hiring and the exact salary? I would suggest truncating the hiring day to the month and rounding the salary to $1000. This resulted in my case to 12K records. This still results in over 100K operations, but this query runs in a minute.
```
#create table Test.worker2 as
With tbl as
(
SELECT
  round(salary/1000)*1000 salary, 
  date_trunc(hire_date,month) as hire_month,
  COUNT(1) as counts,
  array_agg(id) id
FROM
  Test.worker
GROUP BY 1,2
ORDER BY 1,2
)

Select * except(B),
B.*
from(
Select A.* ,
((Select struct(count(1) as LessSalaray_id_counts,ARRAY_CONCAT_AGG(B.id) as LessSalaray_id) from tbl B where A.hire_month<B.hire_month and A.salary<B.salary
)) as B
from tbl A
)
```
For obtaining the exact calculation, please remove the # in the fist line and replace both the < by <= to export the data in a new table of several gb size. Then run this query
```
CREATE OR REPLACE TABLE Test.worker3 AS
SELECT a.*,B_id
FROM Test.worker2,
  UNNEST(id) AS A_id,
  UNNEST(lesssalaray_id) B_id
LEFT JOIN Test.worker a
ON A_id=a.id
LEFT JOIN Test.worker b
ON  B_id=b.id
WHERE 
  a.hire_date<b.hire_date
  AND a.salary<b.salary
```
which will run quite long, but after about 20 minutes, you obtain the extact answer.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – How to speed up range joins in bigquery

Answers