skip to Main Content

How to use two aggregate functions (for example double string_agg or just sum), but ensure that the results do not contain duplicates caused by the other aggregate function (caused by second join)? I use PostgreSQL.

Example

I have three tables:

create table boxes
(
    id   bigserial primary key,
    name varchar(255)
);

create table animals
(
    id     bigserial primary key,
    name   varchar(255),
    age    numeric,
    box_id bigint constraint animals_boxes_id references boxes
);

create table vegetables
(
    id     bigserial primary key,
    name   varchar(255),
    weight numeric,
    box_id bigint constraint vegatables_box_id references boxes
);

Some input data:

insert into boxes (name) values ('First box');
insert into animals (box_id, name, age) values (1, 'Cat', 2);
insert into animals (box_id, name, age) values (1, 'Cat', 3);
insert into animals (box_id, name, age) values (1, 'Dog', 5);
insert into vegetables (box_id, name, weight) values (1, 'Tomato', 20);
insert into vegetables (box_id, name, weight) values (1, 'Cucumber', 30);
insert into vegetables (box_id, name, weight) values (1, 'Potato', 50);

And I want to get animal names in boxes:

select b.name                                 as box_name,
       string_agg(a.name, ', ' order by a.id) as animal_names
from boxes as b
         left join animals a on b.id = a.box_id
group by b.name;

It works:

box_name animal_names
First box Cat, Cat, Dog

But I also want to get vegetable names. But it doesn’t work:

select b.name                                 as box_name,
       string_agg(a.name, ', ' order by a.id) as animal_names,
       string_agg(v.name, ', ' order by v.id) as vegatable_names
from boxes as b
         left join animals a on b.id = a.box_id
         left join vegetables v on b.id = v.box_id
group by b.name;

It produces duplicates in animal names and vegetable names:

box_name animal_names vegatable_names
First box Cat, Cat, Cat, Cat, Cat, Cat, Dog, Dog, Dog Tomato, Tomato, Tomato, Cucumber, Cucumber, Cucumber, Potato, Potato, Potato

Result should be:

box_name animal_names vegatable_names
First box Cat, Cat, Dog Tomato, Cucumber, Potato

I can’t just add distinct to remove duplicates because:

  • Names in tables can duplicates (two animals with name Cat). If I use distinct it produces Cat, Dog instead of Cat, Cat, Dog.
  • I use order by in string_agg (it cause ERROR: in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list when added distinct). Even if I remove order by (string_agg(distinct a.name, ', ')) I can’t use it because first point.

Additional info

It applies to all aggregates function: string_agg, array_agg, json_object_agg and even sum.

Sum age of animals:

select sum(a.age)
from boxes as b
         left join animals a on b.id = a.box_id
         -- left join vegetables v on b.id = v.box_id
group by b.name;

Without second join it calculates correctly (10) but with it wrong (30) – due to duplicates.

2

Answers


  1. try to join the vegetables table to the animals table

    SELECT b.name AS box_name
        ,string_agg(a.name, ', ' ORDER BY a.id) AS animal_names
        ,string_agg(v.name, ', ' ORDER BY v.id) AS vegetable_names
    FROM boxes AS b
    LEFT JOIN animals a ON b.id = a.box_id
    LEFT JOIN vegetables v ON v.id = a.id
    GROUP BY b.name;
    

    output:

    box_name animal_names vegetable_names
    First box Cat, Cat, Dog Tomato, Cucumber, Potato

    fiddle here

    Login or Signup to reply.
  2. The basic problem is explained here:

    For a small selection, LATERAL subqueries are typically fastest:

    SELECT b.name AS box_name, a.*, v.*
    FROM   boxes b
    LEFT  JOIN LATERAL (
       SELECT string_agg(a.name, ', ' ORDER BY a.id) AS animal_names
       FROM   animals a
       WHERE  a.box_id = b.id
       ) a ON true
    LEFT  JOIN LATERAL (
       SELECT string_agg(v.name, ', ' ORDER BY v.id) AS vegatable_names
       FROM   vegetables v
       WHERE  v.box_id = b.id
       ) v ON true;
    

    While aggregating the whole table, this is faster:

    SELECT b.name AS box_name, a.animal_names, v.vegatable_names
    FROM   boxes b
    LEFT   JOIN (
       SELECT box_id, string_agg(a.name, ', ') AS animal_names   
       FROM  (
          SELECT box_id, id, name
          FROM   animals a
          ORDER  BY box_id, id
          ) a
       GROUP  BY 1
       ) a ON a.box_id = b.id
    LEFT   JOIN (
       SELECT box_id, string_agg(v.name, ', ') AS vegatable_names 
       FROM  (
          SELECT box_id, id, name
          FROM   vegetables v
          ORDER  BY box_id, id
          ) v
       GROUP  BY 1   
       ) v ON v.box_id = b.id;
    

    fiddle

    Note how I sort in a subquery, which is typically faster than doing a per-aggregate sort. An optional optimization.

    Aside: about varchar(255) in your test setup:

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search