skip to Main Content

How can I make an SQL query to select records that do not have at least one child element?

I have 3 tables: article (~40K rows), calendar (~450K rows) and calendar_cost (~500K rows).

It is necessary to select such entries of the article table:

  1. there are no entries in the calendar table,
  2. if there are entries in the calendar table, then all of them should not have any entries in the calendar_cost table.
create table article (
    id int PRIMARY KEY,
    name varchar
);

create table calendar (
    id int PRIMARY KEY,
    article_id int REFERENCES article (id) ON DELETE CASCADE,
    number varchar
);

create table calendar_cost (
    id int PRIMARY KEY,
    calendar_id int REFERENCES calendar (id) ON DELETE CASCADE,
    cost_value numeric
);

insert into article (id, name) values
(1, 'Article 1'),
(2, 'Article 2'),
(3, 'Article 3');

insert into calendar (id, article_id, number) values
(101, 1, 'Point 1-1'),
(102, 1, 'Point 1-2'),
(103, 2, 'Point 2');

insert into calendar_cost (id, calendar_id, cost_value) values
(400, 101, 100.123),
(401, 101, 400.567);

As a result, "Article 2" (condition 2) and "Article 3" (condition 1) will suit us.

My SQL query is very slow (the second condition part), how can I do it optimally? Is it possible to do without "union all" operator?

-- First condition
select a.id from article a 
left join calendar c on a.id = c.article_id
where c.id is null

union all 

-- Second condition
select a.id from article a
where id not in(
    select aa.id from article aa
    join calendar c on aa.id = c.article_id
    join calendar_cost cost on c.id = cost.calendar_id
    where aa.id = a.id limit 1
)

UPDATE

This is how you can fill my tables with random data for about the same amount of data. The @Bohemian query is very fast, and the rest are very slow. But as soon as I applied 2 indexes, as @nik advised, all queries began to be executed very, very quickly!

do $$
declare
    article_id int;     
    calendar_id bigint;
    i int; j int;
begin
    create table article (
        id int PRIMARY KEY,
        name varchar
    );
    
    create table calendar (
        id serial PRIMARY KEY,
        article_id int REFERENCES article (id) ON DELETE CASCADE,
        number varchar      
    );
    create INDEX ON calendar(article_id);

    create table calendar_cost (
        id serial PRIMARY KEY,
        calendar_id bigint REFERENCES calendar (id) ON DELETE CASCADE,
        cost_value numeric      
    );
    create INDEX ON calendar_cost(calendar_id);

    for article_id in 1..45000 loop
        insert into article (id, name) values (article_id, 'Article ' || article_id);
        for i in 0..floor(random() * 25) loop
            insert into calendar (article_id, number) values (article_id, 'Number ' || article_id || '-' || i) returning id into calendar_id;
            for j in 0..floor(random() * 2) loop
                insert into calendar_cost (calendar_id, cost_value) values (calendar_id, round((random() * 100)::numeric, 3));  
            end loop;
        end loop;
    end loop;
end $$;
@Bohemian
Planning Time: 0.405 ms
Execution Time: 1196.082 ms

@nbk
Planning Time: 0.702 ms
Execution Time: 165.129 ms

@Chris Maurer
Planning Time: 0.803 ms
Execution Time: 800.000 ms

@Stu
Planning Time: 0.446 ms
Execution Time: 280.842 ms

So which query to choose now as the right one is a matter of taste.

4

Answers


  1. No need to split the conditions: The only condition you need to check for is that there are no calendar_cost rows whatsoever, which is the case if there are no calendar rows.

    The trick is to use outer joins, which still return the parent table but have all null values when there is no join. Further, count() does not count null values, so requiring that the count of calendar_cost is zero is all you need.

    select a.id
    from article a
    left join calendar c on c.article_id = a.id
    left join calendar_cost cost on cost.calendar_id = c.id
    group by a.id
    having count(cost.calendar_id) = 0
    

    See live demo.

    If there are indexes on the id columns (the usual case), this query will perform quite well given the small table sizes.

    Login or Signup to reply.
  2. Your second condition should start just like your first one: find all the calendar entries without calendar cost and only afterwards join it to article.

    select a.id
    from article a
      Inner Join (
        Select article_id
        From calendar c left join calendar_cost cc
          On c.id=cc.calendar_id
        Where cc.calendar_id is null
    ) cnone
      On a.id = cnone.article_id
    

    This approach is based on the thought that calendar entries without calendar_cost is relatively rare compared to all the calendar entries.

    Login or Signup to reply.
  3. Your query is not valid as IN clauses don’t support LIMIT

    Adding some indexes on article_id and calender_id

    Will help the performance

    As you can see in the query plan

    create table article (
        id int PRIMARY KEY,
        name varchar(100)
    );
    
    create table calendar (
        id int PRIMARY KEY,
        article_id int REFERENCES article (id) ON DELETE CASCADE,
        number varchar(100)
      ,index(article_id)
    );
    
    create table calendar_cost (
        id int PRIMARY KEY,
        calendar_id int REFERENCES calendar (id) ON DELETE CASCADE,
        cost_value numeric
      ,INDEX(calendar_id)
    );
    
    insert into article (id, name) values
    (1, 'Article 1'),
    (2, 'Article 2'),
    (3, 'Article 3');
    
    insert into calendar (id, article_id, number) values
    (101, 1, 'Point 1-1'),
    (102, 1, 'Point 1-2'),
    (103, 2, 'Point 2');
    
    insert into calendar_cost (id, calendar_id, cost_value) values
    (400, 101, 100.123),
    (401, 101, 400.567);
    
    
    Records: 3  Duplicates: 0  Warnings: 0
    
    Records: 3  Duplicates: 0  Warnings: 0
    
    Records: 2  Duplicates: 0  Warnings: 2
    
    select a.id from article a 
    left join calendar c on a.id = c.article_id
    where c.id is null
    
    
    id
    3
    -- First condition
    EXPLAIN
      select a.id from article a 
    left join calendar c on a.id = c.article_id
    where c.id is null
    
    union all 
    
    -- Second condition
    select a.id from article a
    JOIN (
        select aa.id from article aa
        join calendar c on aa.id = c.article_id
        join calendar_cost cost on c.id = cost.calendar_id
    
      LIMIT 1
    
    ) t1 ON t1.id <> a.id
    
    
    id select_type table partitions type possible_keys key key_len ref rows filtered Extra
    1 PRIMARY a null index null PRIMARY 4 null 3 100.00 Using index
    1 PRIMARY c null ref article_id article_id 5 fiddle.a.id 3 33.33 Using where; Not exists; Using index
    2 UNION <derived3> null system null null null null 1 100.00 null
    2 UNION a null index null PRIMARY 4 null 3 66.67 Using where; Using index
    3 DERIVED cost null index calendar_id calendar_id 5 null 2 100.00 Using where; Using index
    3 DERIVED c null eq_ref PRIMARY,article_id PRIMARY 4 fiddle.cost.calendar_id 1 100.00 Using where
    3 DERIVED aa null eq_ref PRIMARY PRIMARY 4 fiddle.c.article_id 1 100.00 Using index

    fiddle

    Login or Signup to reply.
  4. Try the following using a combination of exists criteria.

    Usually, with supporting indexes, this is more performant than simply joining tables as it offers a short-circuit to get out as soon as a match is found, where as joining typically filters after all rows are joined.

    select a.id 
    from article a 
    where not exists (
      select * from calendar c 
      where c.article_id =  a.id
    )
    or (exists (
      select * from calendar c 
      where c.article_id = a.id
    )
      and not exists (
        select * from calendar_cost cc
        where cc.calendar_id in (select id from calendar c where c.article_id = a.id)
      )
    );
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search