The database schema below represents a fictional shows database called hbtn_0d_tvshows
tables.
tv_shows |
---|
id |
title |
tv_genres |
---|
id |
name |
tv_show_genres |
---|
id |
show_id |
genre_id |
The tv_show_genres
is a pivot table that links tv_shows and tv_genres tables. Now I have been given a task to retrieve rows without the comedy genre in them. A friend came up with the below query and it works.
SELECT DISTINCT `title`
FROM `tv_shows` AS t
LEFT JOIN `tv_show_genres` AS s
ON s.`show_id` = t.`id`
LEFT JOIN `tv_genres` AS g
ON g.`id` = s.`genre_id`
WHERE t.`title` NOT IN
(SELECT `title`
FROM `tv_shows` AS t
INNER JOIN `tv_show_genres` AS s
ON s.`show_id` = t.`id`
INNER JOIN `tv_genres` AS g
ON g.`id` = s.`genre_id`
WHERE g.`name` = "Comedy")
ORDER BY `title`;
My question is, why do we need the subquery when we can put a where clause like this where g.name != "Comedy"
in the outer query. And since all tables are already loaded by the joins, the filter should work. Like below SQL:
SELECT DISTINCT `title`
FROM `tv_shows` AS t
LEFT JOIN `tv_show_genres` AS s
ON s.`show_id` = t.`id`
LEFT JOIN `tv_genres` AS g
ON g.`id` = s.`genre_id`
WHERE g.`name` != "Comedy";
May be I am missing something, but this is a cause of confusion for me and a stumble in my SQL learning.
Expected results
title
Better Call Saul
Breaking Bad
Dexter
Game of Thrones
Homeland
House
3
Answers
I prepared some example: https://dbfiddle.uk/mb1_Vlc0
I put additional column in the
tv_shows
table with namechannel
.So, now we have two TV-shows with the same name "Funny title" for the different channels: NBC and NBC2.
And on the NBC this is Comedy, but on the NBC2 this is Detective.
Your first example excludes title "Funny title" at all, but your second query will leave this name for one of the channels.
Now you need to make desigion, if you need logic from the first of the sercond query.
Data model:
Screenshot from the my link just for demonstration.
With your test data and query
output is
A) As commented by @nbk , some titles, that have multiple genres, would not be eliminated by condition
WHERE g.`name` != "Comedy"
B) If
genre_id
for sometitle
is not assigned, conditionWHERE g.`name` != "Comedy"
is equal to conditionWHERE g.`name` != null
So:
"Comedy"=null – is false
"Comedy"!=null – is false
or
g.name = null – is false
g.name != null – is false
and this condition not work as desired.
C) Your first query may be better write like this
to avoid cases when there are several rows with the same
Title
and this query has better performance.
Demo here
Let’s start with your second query, which does not provide the desired result:
Firstly, don’t use
"Comedy"
(double quotes) for string literals. It works (with default MySQL config) but is non-standard and should be avoided. Use'Comedy'
(single quotes) instead.Secondly, because your
WHERE
clause is applied to the rightmost table (tv_genres.name
), yourLEFT
joins are implicitly changed toINNER
joins.As already explained by nbk in the comments, your
WHERE
clause is just including all rows whereg.name != 'Comedy'
, which is not the same as "TV shows that do not have the comedy genre".Now, let’s jump back to your first query:
The two
LEFT
joins in the outer query are doing nothing, so we can get rid of them. This also meansDISTINCT
is no longer needed. Instead of joining (fort.`title` NOT IN
) on thetitle
, we should be using theid
, which has the added bonus of not needing to jointv_shows
in the subquery:This can be rewritten as a
LEFT JOIN ... IS NULL
:or
NOT EXISTS
:Here’s a db<>fiddle.
Note that I have changed the indexing of
tv_show_genres
in the db<>fiddle. With few exceptions, pivot tables need composite indices in both directions –PK(show_id, genre_id)
and secondary(genre_id, show_id)
.