I have the following data:
create schema test_schema;
create table test_schema.joinedDf(id long, firstName text, lastName text, age Int, month Int, salary decimal);
insert into test_schema.joinedDf(id, firstName, lastName, age, month, salary)
values( 1111, "Peter", "Ivanov", 29, 10,300000.0000000000),
( 1111, "Peter", "Ivanov", 29, 12,350000.0000000000),
( 1111, "Peter", "Ivanov", 29, 11,350000.0000000000);
When I do this:
select id, firstName, lastName, age, month, salary,
AVG(salary) OVER (PARTITION BY id order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) average_salary
from test_schema.joinedDf;
I get this data:
'1111','Peter','Ivanov','29','10','300000','333333.3333'
'1111','Peter','Ivanov','29','12','350000','333333.3333'
'1111','Peter','Ivanov','29','11','350000','333333.3333'
But when I do:
select id, firstName, lastName, age, month, salary,
AVG(salary) OVER (PARTITION BY id order by salary ) average_salary
from test_schema.joinedDf;
I get:
'1111','Peter','Ivanov','29','10','300000','300000.0000'
'1111','Peter','Ivanov','29','12','350000','333333.3333'
'1111','Peter','Ivanov','29','11','350000','333333.3333'
I read that by default, if you specify order by
inside the partition by
clause. You get the window frame of unbounded preceding and current row
. But why doesn’t the data look like this?:
'1111','Peter','Ivanov','29','10','300000','300000.0000'
'1111','Peter','Ivanov','29','12','350000','325000.0000'
'1111','Peter','Ivanov','29','11','350000','333333.3333'
So, we first have Partition(300000)
, the average is 300000
, then we have Partition(300000, 350000)
, the average is 325000
, then we have Partition(300000, 350000, 350000)
the average is 333333.3333
. No?
3
Answers
Because there are duplicate salaries, this query did not work, instead you need to order by a unique column (like
rn
generated byrow_number()
) or a date/timestamp if exists :Result :
Demo here
The window function
AVG
is more similar toDENSE_RANK
than toROW_NUMBER
. When it finds two (or more) tied values, the running AVG considers them together, that’s the reason why you get your last two averages identical.If you want to make the running AVG to take one row at a time, you need to untie the tied records. The only way to do it, is by ordering according to a field that is not tied between the two tied records, in your case "month" only.
As already mentioned in the comments section, it’s enough to use "month" inside the
ORDER BY
clause. In order to retain the original nature of your query, you can also use it alongside "salary".Output:
Check the demo here.
When you do
AVG(salary) OVER (PARTITION BY id order by salary ) average_salary
you will have 1 partition created by id and 2 partitions inside id partition created by order by clause and RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW will be used.1st partition will be salary for month 10
2nd partition are going to be salaries for months 12 and 11
CURRENT ROW of your first partition will be row with month 10 and as that is only one average salary will be 30000.
CURRENT ROW of your last partition will be row with month 11 (last row) and RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW means range between CURRENT ROW and all previous rows created by partitioning by id (not order by) and that is why salary for month 10 would be taken into account and you get 333333.333 average.
To get result you expected to get you would have to use this query
Here is demo DEMO