Here’s the relational model for the database that I’m going to talk about (a show can have multiple actors and multiple genres): Relational model
(sorry, don’t have enough reputation to insert an image)
What I would like to achieve is to get every detail about a specific show + the genres that it has + the actors that played in the show and their IDs.
I tried:
SELECT shows.*,
STRING_AGG(g.name, ', ') AS genres,
STRING_AGG(CAST(a.id AS VARCHAR), ', ') AS actor_ids,
STRING_AGG(a.name, ', ') AS actors
FROM shows
LEFT JOIN show_genres sg ON shows.id = sg.show_id
LEFT JOIN genres g ON g.id = sg.genre_id
LEFT JOIN show_characters sc ON shows.id = sc.show_id
LEFT JOIN actors a ON sc.actor_id = a.id
WHERE shows.id = 1390
GROUP BY shows.id
The result:
genres | actor_ids | actors |
---|---|---|
Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Action, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Adventure, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Drama, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy, Fantasy | 436511, 436512, 412933, 9739, 414839, 436539, 436541, 11439, 419217, 415442, 413734, 432873, 420752, 421890, 436592, 448107, 458178, 150561, 436590, 412041, 436511, 436512, 412933, 9739, 414839, 436539, 436541, 11439, 419217, 415442, 413734, 432873, 420752, 421890, 436592, 448107, 458178, 150561, 436590, 412041, 436511, 436512, 412933, 9739, 414839, 436539, 436541, 11439, 419217, 415442, 413734, 432873, 420752, 421890, 436592, 448107, 458178, 150561, 436590, 412041, 436511, 436512, 412933, 9739, 414839, 436539, 436541, 11439, 419217, 415442, 413734, 432873, 420752, 421890, 436592, 448107, 458178, 150561, 436590, 412041 | Emilia Clarke, Kit Harington, Peter Dinklage, Lena Headey, Nikolaj Coster-Waldau, Sophie Turner, Maisie Williams, Iain Glen, Alfie Allen, Liam Cunningham, John Bradley, Conleth Hill, Aidan Gillen, Gwendoline Christie, Isaac Hempstead-Wright, Kristofer Hivju, Nathalie Emmanuel, Jacob Anderson, Jerome Flynn, Rory McCann, Emilia Clarke, Kit Harington, Peter Dinklage, Lena Headey, Nikolaj Coster-Waldau, Sophie Turner, Maisie Williams, Iain Glen, Alfie Allen, Liam Cunningham, John Bradley, Conleth Hill, Aidan Gillen, Gwendoline Christie, Isaac Hempstead-Wright, Kristofer Hivju, Nathalie Emmanuel, Jacob Anderson, Jerome Flynn, Rory McCann, Emilia Clarke, Kit Harington, Peter Dinklage, Lena Headey, Nikolaj Coster-Waldau, Sophie Turner, Maisie Williams, Iain Glen, Alfie Allen, Liam Cunningham, John Bradley, Conleth Hill, Aidan Gillen, Gwendoline Christie, Isaac Hempstead-Wright, Kristofer Hivju, Nathalie Emmanuel, Jacob Anderson, Jerome Flynn, Rory McCann, Emilia Clarke, Kit Harington, Peter Dinklage, Lena Headey, Nikolaj Coster-Waldau, Sophie Turner, Maisie Williams, Iain Glen, Alfie Allen, Liam Cunningham, John Bradley, Conleth Hill, Aidan Gillen, Gwendoline Christie, Isaac Hempstead-Wright, Kristofer Hivju, Nathalie Emmanuel, Jacob Anderson, Jerome Flynn, Rory McCann |
Note: only showing the problematic columns
The problem: After joining the "show_characters" and the "actors" table, there are a lot of duplicate data in the string aggregations.
Question: How would I solve this? Without having duplicate data.
My workaround:
@connection_handler
def get_show_by_id(cursor: 'RealDictCursor', id: int) -> 'RealDictRow':
"""
Args:
cursor: a cursor which returns dictionaries (use @connection.connection_handler decorator)
id: number of items shows on a single page
Returns:
All show details in a RealDictRow + genres(as a concatenated string) + actors in a list
containing the actors ids and names in a RealDictRow
"""
query = """
SELECT shows.*,STRING_AGG(g.name,',') AS genres
FROM shows
LEFT JOIN show_genres sg ON shows.id = sg.show_id
LEFT JOIN genres g ON sg.genre_id = g.id
WHERE shows.id = %s
GROUP BY shows.id
"""
val = (id,)
cursor.execute(query, val)
shows = cursor.fetchone()
query = """
SELECT a.id,a.name
FROM actors a
RIGHT JOIN show_characters sc ON a.id = sc.actor_id
RIGHT JOIN shows s ON sc.show_id = s.id
WHERE s.id = %s
GROUP BY a.id
"""
val = (id,)
cursor.execute(query, val)
actors = cursor.fetchall()
shows["actors"] = actors
return shows
Note for workaround: Used two separate queries to achieve the result I wanted, but I’d like to know how would I be able to do this with one SQL query.
Side note: I’d be really happy to receive any feedback about my code in general, or suggestion about how people usually handle when they get back multiple data from an SQL query, do they just split it? Or is it better to write a separate query, like I did, so I get the data in a separate list right away? What about my docstring? I just started trying to use them, not sure if the way I’m doing it is good or not.
2
Answers
Meanwhile I've been able to come up with a solution myself, just to have an easier time managing the data returned later, I convert the returned array of arrays into a list of dictionaries.
One approach in this sort of situation is to STRING_AGG each of the constituent tables (or sets of tables) first; then, LEFT JOIN those contrived tables onto the main table. This sidesteps the problem of multiplication that can occur during consecutive LEFT JOINs.
In your case, try something like this:
You can solve this in other ways, too, but understanding this technique will be helpful in your SQL journey.