skip to Main Content

I am clear on how GROUPING SETS (and ROLLUP and CUBE) work for a single expression. However, I’ve never entirely understood how it works when multiple GROUPING SETS are combined. Here is an example table that I’ve created to help with this question:

CREATE TABLE movies AS (
    SELECT 'Black Widow' Movie, 'Disney' Studio, 2021 AS Year, 226583885 Revenue UNION ALL
    SELECT 'Black Widow', 'Disney', 2022, 126583885 UNION ALL
    SELECT 'Black Widow', 'Disney', 2023, 26583885 UNION ALL
    SELECT 'Spider-man: No Way Home', 'Sony', 2021, 740615703 UNION ALL
    SELECT 'Spider-man: No Way Home', 'Sony', 2022, 640615703 UNION ALL
    SELECT 'Spider-man: No Way Home', 'Sony', 2023, 540615703 UNION ALL
    SELECT 'Top Gun: Maverick', 'Paramount', 2022, 847848146 UNION ALL
    SELECT 'Top Gun: Maverick', 'Paramount', 2023, 647848146 UNION ALL
    SELECT 'The Batman', 'Warner Bros.', 2022, 486122791 UNION ALL
    SELECT 'The Batman', 'Warner Bros.', 2023, 286122791 UNION ALL
    SELECT 'Barbie', 'Warner Bros.', 2023, 1441769400 UNION ALL
    SELECT 'Oppenheimer', 'NBCUniversal', 2023, 950205530
)

If I repeat a GROUP BY element that is not part of a GROUPING SETS|ROLLUP|CUBE, it seems to have no effect, for example:

select row_number() over () num, studio, sum(revenue) from movies group by studio;
┌──────────────────────┬──────────────┬──────────────┐
│ num                  ┆ Studio       ┆ sum(revenue) │
╞══════════════════════╪══════════════╪══════════════╡
│                    1 ┆ Disney       ┆    379751655 │
│                    2 ┆ Sony         ┆   1921847109 │
│                    3 ┆ Paramount    ┆   1495696292 │
│                    4 ┆ Warner Bros. ┆   2214014982 │
│                    5 ┆ NBCUniversal ┆    950205530 │
└──────────────────────┴──────────────┴──────────────┘

Is the same as:

select row_number() over () num, studio, sum(revenue) from movies group by studio, studio, studio;
┌──────────────────────┬──────────────┬──────────────┐
│ num                  ┆ Studio       ┆ sum(revenue) │
╞══════════════════════╪══════════════╪══════════════╡
│                    1 ┆ Disney       ┆    379751655 │
│                    2 ┆ Sony         ┆   1921847109 │
│                    3 ┆ Paramount    ┆   1495696292 │
│                    4 ┆ Warner Bros. ┆   2214014982 │
│                    5 ┆ NBCUniversal ┆    950205530 │
└──────────────────────┴──────────────┴──────────────┘

That seems straightforward. However, when I add in a GROUPING SETS, it does change things when elements are repeated:

select row_number() over () num, studio, sum(revenue) from movies 
group by grouping sets(studio, ());
┌─────┬──────────────┬──────────────┐
│ num ┆ Studio       ┆ sum(revenue) │
╞═════╪══════════════╪══════════════╡
│   1 ┆ Disney       ┆    379751655 │
│   2 ┆ Sony         ┆   1921847109 │
│   3 ┆ Paramount    ┆   1495696292 │
│   4 ┆ Warner Bros. ┆   2214014982 │
│   5 ┆ NBCUniversal ┆    950205530 │
│   6 ┆              ┆   6961515568 │
└─────┴──────────────┴──────────────┘
Elapsed: 3 ms

select row_number() over () num, studio, sum(revenue) from movies 
group by grouping sets(studio, ()), grouping sets(studio, ());
┌─────┬──────────────┬──────────────┐
│ num ┆ Studio       ┆ sum(revenue) │
╞═════╪══════════════╪══════════════╡
│   1 ┆ Disney       ┆    379751655 │
│   2 ┆ Sony         ┆   1921847109 │
│   3 ┆ Paramount    ┆   1495696292 │
│   4 ┆ Warner Bros. ┆   2214014982 │
│   5 ┆ NBCUniversal ┆    950205530 │
│   6 ┆ Disney       ┆    379751655 │
│   7 ┆ Sony         ┆   1921847109 │
│   8 ┆ Paramount    ┆   1495696292 │
│   9 ┆ Warner Bros. ┆   2214014982 │
│  10 ┆ NBCUniversal ┆    950205530 │
│  11 ┆ Disney       ┆    379751655 │
│  12 ┆ Sony         ┆   1921847109 │
│  13 ┆ Paramount    ┆   1495696292 │
│  14 ┆ Warner Bros. ┆   2214014982 │
│  15 ┆ NBCUniversal ┆    950205530 │
│  16 ┆              ┆   6961515568 │
└─────┴──────────────┴──────────────┘
Elapsed: 2 ms

How then is the GROUPING SETS for multiple items done? For example, if the equivalent UNION-ed clause for GROUP BY GROUPING SETS(studio, ()) is:

                                                                        -- GROUPING SETS(
SELECT studio, SUM(revenue) FROM movies GROUP BY studio UNION ALL       --   studio,
SELECT NULL  , SUM(revenue) FROM movies                                 --   ()
                                                                        -- )

Then what would be the equivalent UNION ALL if there are multiple items in the GROUP BY ?


Note: the above queries have been tested in both Postgres — https://www.db-fiddle.com/f/hvrNtMSrAe7UMz6dTn33Y9/0 — and SQL Server (though I was getting arithmetic overflow so I modified the numbers a bit) https://dbfiddle.uk/LW3fXIIo.

3

Answers


  1. Chosen as BEST ANSWER

    Postgres has a good entry on how multiple elements are specified in a GROUP BY clause:

    If multiple grouping items are specified in a single GROUP BY clause, then the final list of grouping sets is the cross product of the individual items. For example:

    Let's do this based on the cross product of the individual items in your example above:

    GROUP BY GROUPING SETS(studio, ()), GROUP BY GROUPING SETS(studio, ())
    

    There are two combinations here -- studio and () (empty) -- for each GROUP BY element. We then have four combinations here:

    GROUP BY GROUPING SETS (
        (studio, studio),
        (studio),
        (studio),
        ()
    )
    

    The first item can be simplified from (studio, studio) to just (studio) (since GROUP BY a, a can be simplified to GROUP BY a as mentioned in the question itself).

    So now we have:

    GROUP BY GROUPING SETS (
        (studio),
        (studio),
        (studio),
        ()
    )
    

    Here we have three 'duplicate' GROUP BYs, and then one where we don't GROUP BY anything. And this gives us our original answer: https://www.db-fiddle.com/f/hvrNtMSrAe7UMz6dTn33Y9/1. Postgres also offers the GROUP BY DISTINCT in case we want to filter out the duplicate entires, meaning doing a GROUP BY DISTINCT GROUPING SETS ((studio), (studio), (studio), ()) would act the same as GROUP BY GROUPING SETS ((studio), ()).

    Finally, to translate this into a UNION ALL as asked in the question, we would get:

    select studio, sum(revenue) from movies group by studio union all
    select studio, sum(revenue) from movies group by studio union all
    select studio, sum(revenue) from movies group by studio union all
    select NULL, sum(revenue) from movies
    

    It's not a useful query of course (especially with the UNION ALL) but it should show you how the cross-product is done here in the GROUP BY. https://www.db-fiddle.com/f/hvrNtMSrAe7UMz6dTn33Y9/2

    enter image description here


  2. The first set, you are grouping by studio, the second set no grouping

    select row_number() over () num, studio, sum(revenue) 
    from movies 
    group by grouping sets(
      (studio)
      , ()
    );
    
    Login or Signup to reply.
  3. From the documentation:

    If multiple grouping items are specified in a single GROUP BY clause, then the final list of grouping sets is the cross product of the individual items.

    So

    group by grouping sets(studio, ()), grouping sets(studio, ())
    

    is equivalent to

    group by grouping sets (
      (studio), -- (studio x studio, simplifies to just studio)
      (studio), -- (studio x ())
      (studio), -- (() x studio)
      (),       -- (() x ())
    )
    

    Note that you can use

    group by distinct grouping sets(...)
    

    in case you want to remove duplicate grouping sets (possibly useful when dealing with complex rules and/or having multiple rollup/cube in a single group by).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search