skip to Main Content

I have the following data:

create schema test_schema;
create table test_schema.joinedDf(id long, firstName text, lastName text, age Int, month Int, salary decimal);

insert into test_schema.joinedDf(id, firstName, lastName, age, month, salary)
values(   1111,    "Peter",    "Ivanov", 29,   10,300000.0000000000),
(   1111,    "Peter",    "Ivanov", 29,   12,350000.0000000000),
(   1111,    "Peter",    "Ivanov", 29,   11,350000.0000000000);

When I do this:

select id, firstName, lastName, age, month, salary,
AVG(salary) OVER (PARTITION BY id order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) average_salary
from test_schema.joinedDf;

I get this data:

'1111','Peter','Ivanov','29','10','300000','333333.3333'
'1111','Peter','Ivanov','29','12','350000','333333.3333'
'1111','Peter','Ivanov','29','11','350000','333333.3333'

But when I do:

select id, firstName, lastName, age, month, salary,
AVG(salary) OVER (PARTITION BY id order by salary ) average_salary
from test_schema.joinedDf;

I get:

'1111','Peter','Ivanov','29','10','300000','300000.0000'
'1111','Peter','Ivanov','29','12','350000','333333.3333'
'1111','Peter','Ivanov','29','11','350000','333333.3333'

I read that by default, if you specify order by inside the partition by clause. You get the window frame of unbounded preceding and current row. But why doesn’t the data look like this?:

'1111','Peter','Ivanov','29','10','300000','300000.0000'
'1111','Peter','Ivanov','29','12','350000','325000.0000'
'1111','Peter','Ivanov','29','11','350000','333333.3333'

So, we first have Partition(300000), the average is 300000, then we have Partition(300000, 350000), the average is 325000, then we have Partition(300000, 350000, 350000) the average is 333333.3333. No?

3

Answers


  1. select id, firstName, lastName, age, month, salary,
    AVG(salary) OVER (PARTITION BY id order by salary ) average_salary
    from test_schema.joinedDf;
    

    Because there are duplicate salaries, this query did not work, instead you need to order by a unique column (like rn generated by row_number()) or a date/timestamp if exists :

    WITH CTE AS (
      select id, firstName, lastName, age, month, salary, row_number() over () as rn
      from test_schema.joinedDf
    )
    select id, firstName, lastName, age, month, salary, 
           AVG(salary) OVER (PARTITION BY id order by rn ) average_salary
    from CTE
    

    Result :

    id  firstName   lastName    age month   salary  average_salary
    1111    Peter   Ivanov      29  10      300000  300000.0000
    1111    Peter   Ivanov      29  12      350000  325000.0000
    1111    Peter   Ivanov      29  11      350000  333333.3333
    

    Demo here

    Login or Signup to reply.
  2. The window function AVG is more similar to DENSE_RANK than to ROW_NUMBER. When it finds two (or more) tied values, the running AVG considers them together, that’s the reason why you get your last two averages identical.

    If you want to make the running AVG to take one row at a time, you need to untie the tied records. The only way to do it, is by ordering according to a field that is not tied between the two tied records, in your case "month" only.

    As already mentioned in the comments section, it’s enough to use "month" inside the ORDER BY clause. In order to retain the original nature of your query, you can also use it alongside "salary".

    SELECT id, firstName, lastName, age, month, salary,
           AVG(salary) OVER (PARTITION BY id ORDER BY salary, month) average_salary
    FROM test_schema.joinedDf;
    

    Output:

    id firstName lastName age month salary average_salary
    1111 Peter Ivanov 29 10 300000 300000.0000
    1111 Peter Ivanov 29 11 350000 325000.0000
    1111 Peter Ivanov 29 12 350000 333333.3333

    Check the demo here.

    Login or Signup to reply.
  3. When you do AVG(salary) OVER (PARTITION BY id order by salary ) average_salary you will have 1 partition created by id and 2 partitions inside id partition created by order by clause and RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW will be used.

    1st partition will be salary for month 10

    2nd partition are going to be salaries for months 12 and 11

    CURRENT ROW of your first partition will be row with month 10 and as that is only one average salary will be 30000.

    CURRENT ROW of your last partition will be row with month 11 (last row) and RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW means range between CURRENT ROW and all previous rows created by partitioning by id (not order by) and that is why salary for month 10 would be taken into account and you get 333333.333 average.

    To get result you expected to get you would have to use this query

    SELECT id, firstName, lastName, age, month, salary,
       AVG(salary) OVER (partition by id order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) average_salary
    FROM test_schema.joinedDf;
    

    Here is demo DEMO

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search