How to use two aggregate functions (for example double string_agg
or just sum
), but ensure that the results do not contain duplicates caused by the other aggregate function (caused by second join)? I use PostgreSQL.
Example
I have three tables:
create table boxes
(
id bigserial primary key,
name varchar(255)
);
create table animals
(
id bigserial primary key,
name varchar(255),
age numeric,
box_id bigint constraint animals_boxes_id references boxes
);
create table vegetables
(
id bigserial primary key,
name varchar(255),
weight numeric,
box_id bigint constraint vegatables_box_id references boxes
);
Some input data:
insert into boxes (name) values ('First box');
insert into animals (box_id, name, age) values (1, 'Cat', 2);
insert into animals (box_id, name, age) values (1, 'Cat', 3);
insert into animals (box_id, name, age) values (1, 'Dog', 5);
insert into vegetables (box_id, name, weight) values (1, 'Tomato', 20);
insert into vegetables (box_id, name, weight) values (1, 'Cucumber', 30);
insert into vegetables (box_id, name, weight) values (1, 'Potato', 50);
And I want to get animal names in boxes:
select b.name as box_name,
string_agg(a.name, ', ' order by a.id) as animal_names
from boxes as b
left join animals a on b.id = a.box_id
group by b.name;
It works:
box_name | animal_names |
---|---|
First box | Cat, Cat, Dog |
But I also want to get vegetable names. But it doesn’t work:
select b.name as box_name,
string_agg(a.name, ', ' order by a.id) as animal_names,
string_agg(v.name, ', ' order by v.id) as vegatable_names
from boxes as b
left join animals a on b.id = a.box_id
left join vegetables v on b.id = v.box_id
group by b.name;
It produces duplicates in animal names and vegetable names:
box_name | animal_names | vegatable_names |
---|---|---|
First box | Cat, Cat, Cat, Cat, Cat, Cat, Dog, Dog, Dog | Tomato, Tomato, Tomato, Cucumber, Cucumber, Cucumber, Potato, Potato, Potato |
Result should be:
box_name | animal_names | vegatable_names |
---|---|---|
First box | Cat, Cat, Dog | Tomato, Cucumber, Potato |
I can’t just add distinct
to remove duplicates because:
- Names in tables can duplicates (two animals with name
Cat
). If I usedistinct
it producesCat, Dog
instead ofCat, Cat, Dog
. - I use
order by
instring_agg
(it causeERROR: in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list
when addeddistinct
). Even if I removeorder by
(string_agg(distinct a.name, ', ')
) I can’t use it because first point.
Additional info
It applies to all aggregates function: string_agg
, array_agg
, json_object_agg
and even sum
.
Sum age of animals:
select sum(a.age)
from boxes as b
left join animals a on b.id = a.box_id
-- left join vegetables v on b.id = v.box_id
group by b.name;
Without second join it calculates correctly (10
) but with it wrong (30
) – due to duplicates.
2
Answers
try to join the
vegetables
table to theanimals
tableoutput:
fiddle here
The basic problem is explained here:
For a small selection,
LATERAL
subqueries are typically fastest:While aggregating the whole table, this is faster:
fiddle
Note how I sort in a subquery, which is typically faster than doing a per-aggregate sort. An optional optimization.
Aside: about
varchar(255)
in your test setup: