I am clear on how GROUPING SETS
(and ROLLUP
and CUBE
) work for a single expression. However, I’ve never entirely understood how it works when multiple GROUPING SETS
are combined. Here is an example table that I’ve created to help with this question:
CREATE TABLE movies AS (
SELECT 'Black Widow' Movie, 'Disney' Studio, 2021 AS Year, 226583885 Revenue UNION ALL
SELECT 'Black Widow', 'Disney', 2022, 126583885 UNION ALL
SELECT 'Black Widow', 'Disney', 2023, 26583885 UNION ALL
SELECT 'Spider-man: No Way Home', 'Sony', 2021, 740615703 UNION ALL
SELECT 'Spider-man: No Way Home', 'Sony', 2022, 640615703 UNION ALL
SELECT 'Spider-man: No Way Home', 'Sony', 2023, 540615703 UNION ALL
SELECT 'Top Gun: Maverick', 'Paramount', 2022, 847848146 UNION ALL
SELECT 'Top Gun: Maverick', 'Paramount', 2023, 647848146 UNION ALL
SELECT 'The Batman', 'Warner Bros.', 2022, 486122791 UNION ALL
SELECT 'The Batman', 'Warner Bros.', 2023, 286122791 UNION ALL
SELECT 'Barbie', 'Warner Bros.', 2023, 1441769400 UNION ALL
SELECT 'Oppenheimer', 'NBCUniversal', 2023, 950205530
)
If I repeat a GROUP BY
element that is not part of a GROUPING SETS|ROLLUP|CUBE
, it seems to have no effect, for example:
select row_number() over () num, studio, sum(revenue) from movies group by studio;
┌──────────────────────┬──────────────┬──────────────┐
│ num ┆ Studio ┆ sum(revenue) │
╞══════════════════════╪══════════════╪══════════════╡
│ 1 ┆ Disney ┆ 379751655 │
│ 2 ┆ Sony ┆ 1921847109 │
│ 3 ┆ Paramount ┆ 1495696292 │
│ 4 ┆ Warner Bros. ┆ 2214014982 │
│ 5 ┆ NBCUniversal ┆ 950205530 │
└──────────────────────┴──────────────┴──────────────┘
Is the same as:
select row_number() over () num, studio, sum(revenue) from movies group by studio, studio, studio;
┌──────────────────────┬──────────────┬──────────────┐
│ num ┆ Studio ┆ sum(revenue) │
╞══════════════════════╪══════════════╪══════════════╡
│ 1 ┆ Disney ┆ 379751655 │
│ 2 ┆ Sony ┆ 1921847109 │
│ 3 ┆ Paramount ┆ 1495696292 │
│ 4 ┆ Warner Bros. ┆ 2214014982 │
│ 5 ┆ NBCUniversal ┆ 950205530 │
└──────────────────────┴──────────────┴──────────────┘
That seems straightforward. However, when I add in a GROUPING SETS
, it does change things when elements are repeated:
select row_number() over () num, studio, sum(revenue) from movies
group by grouping sets(studio, ());
┌─────┬──────────────┬──────────────┐
│ num ┆ Studio ┆ sum(revenue) │
╞═════╪══════════════╪══════════════╡
│ 1 ┆ Disney ┆ 379751655 │
│ 2 ┆ Sony ┆ 1921847109 │
│ 3 ┆ Paramount ┆ 1495696292 │
│ 4 ┆ Warner Bros. ┆ 2214014982 │
│ 5 ┆ NBCUniversal ┆ 950205530 │
│ 6 ┆ ┆ 6961515568 │
└─────┴──────────────┴──────────────┘
Elapsed: 3 ms
select row_number() over () num, studio, sum(revenue) from movies
group by grouping sets(studio, ()), grouping sets(studio, ());
┌─────┬──────────────┬──────────────┐
│ num ┆ Studio ┆ sum(revenue) │
╞═════╪══════════════╪══════════════╡
│ 1 ┆ Disney ┆ 379751655 │
│ 2 ┆ Sony ┆ 1921847109 │
│ 3 ┆ Paramount ┆ 1495696292 │
│ 4 ┆ Warner Bros. ┆ 2214014982 │
│ 5 ┆ NBCUniversal ┆ 950205530 │
│ 6 ┆ Disney ┆ 379751655 │
│ 7 ┆ Sony ┆ 1921847109 │
│ 8 ┆ Paramount ┆ 1495696292 │
│ 9 ┆ Warner Bros. ┆ 2214014982 │
│ 10 ┆ NBCUniversal ┆ 950205530 │
│ 11 ┆ Disney ┆ 379751655 │
│ 12 ┆ Sony ┆ 1921847109 │
│ 13 ┆ Paramount ┆ 1495696292 │
│ 14 ┆ Warner Bros. ┆ 2214014982 │
│ 15 ┆ NBCUniversal ┆ 950205530 │
│ 16 ┆ ┆ 6961515568 │
└─────┴──────────────┴──────────────┘
Elapsed: 2 ms
How then is the GROUPING SETS
for multiple items done? For example, if the equivalent UNION
-ed clause for GROUP BY GROUPING SETS(studio, ())
is:
-- GROUPING SETS(
SELECT studio, SUM(revenue) FROM movies GROUP BY studio UNION ALL -- studio,
SELECT NULL , SUM(revenue) FROM movies -- ()
-- )
Then what would be the equivalent UNION ALL
if there are multiple items in the GROUP BY
?
Note: the above queries have been tested in both Postgres — https://www.db-fiddle.com/f/hvrNtMSrAe7UMz6dTn33Y9/0 — and SQL Server (though I was getting arithmetic overflow so I modified the numbers a bit) https://dbfiddle.uk/LW3fXIIo.
3
Answers
Postgres has a good entry on how multiple elements are specified in a
GROUP BY
clause:Let's do this based on the cross product of the individual items in your example above:
There are two combinations here --
studio
and()
(empty) -- for eachGROUP BY
element. We then have four combinations here:The first item can be simplified from
(studio, studio)
to just(studio)
(sinceGROUP BY a, a
can be simplified toGROUP BY a
as mentioned in the question itself).So now we have:
Here we have three 'duplicate'
GROUP BY
s, and then one where we don'tGROUP BY
anything. And this gives us our original answer: https://www.db-fiddle.com/f/hvrNtMSrAe7UMz6dTn33Y9/1. Postgres also offers theGROUP BY DISTINCT
in case we want to filter out the duplicate entires, meaning doing aGROUP BY DISTINCT GROUPING SETS ((studio), (studio), (studio), ())
would act the same asGROUP BY GROUPING SETS ((studio), ())
.Finally, to translate this into a
UNION ALL
as asked in the question, we would get:It's not a useful query of course (especially with the
UNION ALL
) but it should show you how the cross-product is done here in theGROUP BY
. https://www.db-fiddle.com/f/hvrNtMSrAe7UMz6dTn33Y9/2The first set, you are grouping by
studio
, the second set no groupingFrom the documentation:
So
is equivalent to
Note that you can use
in case you want to remove duplicate grouping sets (possibly useful when dealing with complex rules and/or having multiple rollup/cube in a single group by).