skip to Main Content

The database schema below represents a fictional shows database called hbtn_0d_tvshows tables.

tv_shows
id
title
tv_genres
id
name
tv_show_genres
id
show_id
genre_id

The tv_show_genres is a pivot table that links tv_shows and tv_genres tables. Now I have been given a task to retrieve rows without the comedy genre in them. A friend came up with the below query and it works.

SELECT DISTINCT `title`
  FROM `tv_shows` AS t
       LEFT JOIN `tv_show_genres` AS s
       ON s.`show_id` = t.`id`

       LEFT JOIN `tv_genres` AS g
       ON g.`id` = s.`genre_id`
       WHERE t.`title` NOT IN
             (SELECT `title`
                FROM `tv_shows` AS t
                 INNER JOIN `tv_show_genres` AS s
             ON s.`show_id` = t.`id`

             INNER JOIN `tv_genres` AS g
             ON g.`id` = s.`genre_id`
             WHERE g.`name` = "Comedy")
 ORDER BY `title`;

My question is, why do we need the subquery when we can put a where clause like this where g.name != "Comedy" in the outer query. And since all tables are already loaded by the joins, the filter should work. Like below SQL:

SELECT DISTINCT `title`
      FROM `tv_shows` AS t
           LEFT JOIN `tv_show_genres` AS s
           ON s.`show_id` = t.`id`

           LEFT JOIN `tv_genres` AS g
           ON g.`id` = s.`genre_id`
           WHERE g.`name` != "Comedy";

May be I am missing something, but this is a cause of confusion for me and a stumble in my SQL learning.

Sample database with data

Expected results

title
Better Call Saul
Breaking Bad
Dexter
Game of Thrones
Homeland
House

3

Answers


  1. I prepared some example: https://dbfiddle.uk/mb1_Vlc0

    I put additional column in the tv_shows table with name channel.
    So, now we have two TV-shows with the same name "Funny title" for the different channels: NBC and NBC2.
    And on the NBC this is Comedy, but on the NBC2 this is Detective.
    Your first example excludes title "Funny title" at all, but your second query will leave this name for one of the channels.

    Now you need to make desigion, if you need logic from the first of the sercond query.

    Data model:

    insert into tv_shows values
    (1, 'First title', 'NBC'),
    (2, 'Second title', 'SomeTVShowChannel'),
    (3, 'Some title', 'NBC'),
    (4, 'Funny title', 'NBC'),
    (5, 'Funny title', 'NBC2');
    insert into tv_genres values
    (1, 'Comedy'),
    (2, 'Detective');
    insert into tv_show_genres values
    (1, 1, 1),
    (2, 2, 2),
    (3, 3, 2),
    (4, 4, 1),
    (5, 5, 2);
    

    Screenshot from the my link just for demonstration.
    enter image description here

    Login or Signup to reply.
  2. With your test data and query

    select *
    from tv_shows as s
    left join tv_show_genres as sg on sg.show_id=s.id
    left join tv_genres g on g.id=sg.genre_id
    

    output is

    id title show_id genre_id id name
    1 House 1 1 1 Drama
    1 House 1 2 2 Mystery
    2 Game of Thrones 2 3 3 Adventure
    2 Game of Thrones 2 1 1 Drama
    2 Game of Thrones 2 4 4 Fantasy
    3 The Big Bang Theory 3 5 5 Comedy
    4 New Girl 4 5 5 Comedy
    5 Silicon Valley 5 5 5 Comedy
    6 Breaking Bad 6 6 6 Crime
    6 Breaking Bad 6 1 1 Drama
    6 Breaking Bad 6 7 7 Suspense
    6 Breaking Bad 6 8 8 Thriller
    7 Better Call Saul null null null null
    8 Dexter 8 6 6 Crime
    8 Dexter 8 1 1 Drama
    8 Dexter 8 2 2 Mystery
    8 Dexter 8 7 7 Suspense
    8 Dexter 8 8 8 Thriller
    9 Homeland null null null null
    10 The Last Man on Earth 10 5 5 Comedy
    10 The Last Man on Earth 10 1 1 Drama

    A) As commented by @nbk , some titles, that have multiple genres, would not be eliminated by condition WHERE g.`name` != "Comedy"

    B) If genre_id for some title is not assigned, condition WHERE g.`name` != "Comedy" is equal to condition WHERE g.`name` != null
    So:
    "Comedy"=null – is false
    "Comedy"!=null – is false
    or
    g.name = null – is false
    g.name != null – is false

    and this condition not work as desired.

    C) Your first query may be better write like this

    SELECT title
    FROM `tv_shows` AS t
    WHERE id not in
        (
         SELECT show_id
         FROM `tv_show_genres` AS s
         INNER JOIN `tv_genres` AS g
               ON s.`genre_id` = g.`id`
         WHERE g.`name` = "Comedy"
       )
     ORDER BY `title`;
    

    to avoid cases when there are several rows with the same Title
    and this query has better performance.

    Demo here

    Login or Signup to reply.
  3. Let’s start with your second query, which does not provide the desired result:

    SELECT DISTINCT `title`
    FROM `tv_shows` AS t
    LEFT JOIN `tv_show_genres` AS s
        ON s.`show_id` = t.`id`
    LEFT JOIN `tv_genres` AS g
        ON g.`id` = s.`genre_id`
    WHERE g.`name` != "Comedy";
    

    Firstly, don’t use "Comedy" (double quotes) for string literals. It works (with default MySQL config) but is non-standard and should be avoided. Use 'Comedy' (single quotes) instead.

    Secondly, because your WHERE clause is applied to the rightmost table (tv_genres.name), your LEFT joins are implicitly changed to INNER joins.

    As already explained by nbk in the comments, your WHERE clause is just including all rows where g.name != 'Comedy', which is not the same as "TV shows that do not have the comedy genre".

    Now, let’s jump back to your first query:

    
    SELECT DISTINCT `title`
      FROM `tv_shows` AS t
           LEFT JOIN `tv_show_genres` AS s
           ON s.`show_id` = t.`id`
    
           LEFT JOIN `tv_genres` AS g
           ON g.`id` = s.`genre_id`
           WHERE t.`title` NOT IN
                 (SELECT `title`
                    FROM `tv_shows` AS t
                     INNER JOIN `tv_show_genres` AS s
                 ON s.`show_id` = t.`id`
    
                 INNER JOIN `tv_genres` AS g
                 ON g.`id` = s.`genre_id`
                 WHERE g.`name` = "Comedy"'Comedy')
     ORDER BY `title`;
    

    The two LEFT joins in the outer query are doing nothing, so we can get rid of them. This also means DISTINCT is no longer needed. Instead of joining (for t.`title` NOT IN) on the title, we should be using the id, which has the added bonus of not needing to join tv_shows in the subquery:

    SELECT `title`
    FROM `tv_shows` AS t
    WHERE t.`id` NOT IN (
        SELECT `show_id`
        FROM `tv_show_genres` AS s
        INNER JOIN `tv_genres` AS g
            ON g.`id` = s.`genre_id`
        WHERE g.`name` = 'Comedy'
    )
     ORDER BY `title`;
    

    This can be rewritten as a LEFT JOIN ... IS NULL:

    SELECT `title`
    FROM `tv_shows` AS t
    LEFT JOIN `tv_show_genres` AS s
        ON s.`show_id` = t.`id`
        AND s.`genre_id` = (SELECT id FROM `tv_genres` WHERE `name` = 'Comedy')
    WHERE s.`genre_id` IS NULL
    ORDER BY `title`;
    

    or NOT EXISTS:

    SELECT *
    FROM tv_shows s
    WHERE NOT EXISTS (
        SELECT 1
        FROM tv_show_genres sg
        JOIN tv_genres g ON sg.genre_id = g.id
        WHERE sg.show_id = s.id
        AND g.name = 'Comedy'
    )
    ORDER BY title;
    

    Here’s a db<>fiddle.

    Note that I have changed the indexing of tv_show_genres in the db<>fiddle. With few exceptions, pivot tables need composite indices in both directions – PK(show_id, genre_id) and secondary (genre_id, show_id).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search