Postgresql - Add an aggregate over full table outside the buckets to every bucket row

Dada
November 23, 2023
165 views
0 votes
2 Answers

I have a query that filters the emails of people who "have a quantity" bigger than the median over all quantities found in the table:

SELECT
    AVG(quantity) AS avg_quantity,
    email   
FROM table 
GROUP BY email
HAVING AVG(quantity) > (SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY quantity) FROM table)

The result is:

123.12    [email protected]
0.5       [email protected]

In addition to this, I would like to add a column which will have for all rows the same value, that of the median calculated in the HAVING clause above. If we suppose this median to be equal to 0.01, then I would like to have the following result:

123.12    [email protected]    0.01
0.5       [email protected]    0.01

I tried to doing a Cartesian product:

WITH (
    SELECT
        AVG(quantity) AS avg_quantity,
        email   
    FROM table 
    GROUP BY email
    HAVING AVG(quantity) > (SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY quantity) FROM table)
) AS tmp
SELECT tmp.*, PERCENTILE_DISC(0.9) WITHIN GROUP (ORDER BY table.quantity)  from tmp, table

But I get this error:

ERROR: column "tmp. avg_quantity" must appear in the GROUP BY clause or be used in an aggregate function

How can I achieve the above expected output with 3 columns?

Answers

You can use it as an uncorrelated scalar subquery right in the select list. Demo at db<>fiddle:

SELECT
    AVG(quantity) AS avg_quantity,
    email,
    (SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY quantity) FROM "table")
FROM "table" 
GROUP BY email
HAVING AVG(quantity) > (SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY quantity) FROM "table");

You’re already using the query this way, just elsewhere. You can actually re-use it for both with the CTE you already tried out:

with _(median) as materialized (
  SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY quantity) FROM "table")
SELECT AVG(quantity) AS avg_quantity,
       email,
       (SELECT median from _)
FROM "table" 
GROUP BY email
HAVING AVG(quantity) > (SELECT median from _);

Your idea with a Cartesian product would work too. You’re computing one median for the whole table, so the same single result gets glued to every row. It has to be grouped by, but being identical throughout, it does not affect the current grouping.

SELECT AVG(quantity) AS avg_quantity,
       email,
       median
FROM "table"
,(SELECT PERCENTILE_DISC(0.5)WITHIN GROUP(ORDER BY quantity) FROM "table"
 ) AS _(median)
GROUP BY email,median
HAVING AVG(quantity) > median;

- ErwinBrandstetter
- November 23, 2023 at 3:04 pm
- 0 votes
0
Assuming the two distinct percentiles (0.5 <> 0.9) in your question are not just typos.

To compute multiple percentiles, consider the other variant of percentile_disc() taking an array of fractions, and returning a corresponding array of values. Use it a CTE once, and reference the result twice.
```
WITH p AS (
   SELECT percentile_disc('{.5, .9}'::float8[]) WITHIN GROUP (ORDER BY quantity) AS pctl
   FROM   tbl
   )
SELECT email, round(avg(quantity), 2) AS avg_quantity, (SELECT pctl[2] FROM p) AS pctl_90
FROM   tbl
GROUP  BY email
HAVING avg(quantity) > (SELECT pctl[1] FROM p);
```
fiddle

This way you save an additional scan over the table.
But use a CTE even for a single percentile that’s subsequently referenced in two different places.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Add an aggregate over full table outside the buckets to every bucket row

Answers