I am new to SQL and want to execute a query with joins.
What I noticed is, when I have 2 rows in my table workers_send_times then I get doubled the same rows like
start_time: 2024-01-01
end_time: 2024-01-02
start_time: 2024-01-01
end_time: 2024-01-02
start_time: 2024-01-01
end_time: 2024-01-02
start_time: 2024-01-01
end_time: 2024-01-02
So I have 2 rows but I get 4. if I have 4 rows in workers_send_times table then I get 16 rows so it shows me the duplicated rows.
What I am doing wrong ?
SELECT
u.id,
u.firstname,
u.lastname,
wp.location_orders_id,
l.name,
wst.start_time,
wst.end_time
FROM users u
INNER JOIN workers_plan wp
ON wp.user_id = u.id
INNER JOIN workers_send_times wst
ON wst.user_id = u.id
LEFT JOIN location_orders lo
ON lo.id = wp.location_orders_id
LEFT JOIN location l
ON l.id = lo.location_id
WHERE u.id = '32dc3da6-1e56-4a14-b8b5-8f532edb1b04'
So I google it how can I prevent then I came across group by.
I added GROUP_BY wst.id because I want to group by my workers_send_times id that I dont get duplicated rows.
But Postgresql says to me I have to appear all my select data in group by like this:
GROUP BY wst.id, u.id, wp.location_orders_id, l.name, wst.start_time, wst.end_time
So now it works and I get no duplicated rows but did I miss something or can I do it better. What would you change in my query ?
2
Answers
This became too long for a comment, so I’ll add it as an answer. GROUP BY is a powerful operator in SQL, but I get the feeling that it is often misunderstood. In your example, you could have added one or more aggregate functions such as COUNT, MIN, MAX, etc. You would then apply those to each group:
If you have no intention of applying such a function in your groups, a shorthand is to use DISTINCT. DISTINCT is not a function, it applies to the whole row, your example would then be:
I used the shorter form of INNER JOIN (just JOIN) in my example. That said, your way is fine and more general (say you want to apply a function later on), but distinct is a bit shorter and perhaps more obvious.
I think, I should add an answer here, despite there being an accepted answer already.
No. As long as these are tables and not views, this cannot happen. If two rows in a table lead to four rows in the result, but four rows lead to sixteen rows, then you must be joining the table to itself. That could only be the case in your query, if one of the supposed tables were a view including the workers_send_times table.
Well, if workers_send_times is a view containing workers_plan, then the solution is either not to use the view, but the underlying tables or get the workers_plan data from the view and not join the table again.
If, on the other hand these are tables, and you get duplicate results, then the numbers do not get squared, but only multiplied. The easiest method to get rid of duplicates is to use
SELECT DISTINCT
. This is especially what beginners do in such situations. The professional developer on the other hand asks: why are there duplicates?SELECT DISTINCT
is very often an indicator for a bad database design or an inappropriate query.Let’s look at the joins: You start with a user, get its users row and then join all workers_plan rows for that user. Then you join all workers_send_times rows for the user. This looks wrong. SQL is about relations, and while both tables are related to users, the two tables are not really related to each other, it seems. A user may have three worker plans A, B, and C. The same user may have two send times rows, say one of 8am, one of 10am. Do these send times relate to one of the plans? Probably not, because they are related to a user by user_id. Otherwise there would be a workers_plan_id instead in the table. So there are three plans and two send times, and you are joining these. This is called a cartesian join and leads to six rows (3 x 2): A|8am, A|10am, B|8am, B|10am, C|8am, C|10am. Why are you joining these two tables in order to show the location and start and end time? The times are not related to the work plans’ locations.
The question to ask now is: Is the data model flawed? This would be the case when times should actually be related to a plan. And the solution would be to replace the user_id by a workers_plan_id in workers_send_times. If the data model is not flawed, then the query is. Maybe you just want two queries, one showing the user’s plans and locations, and the other showing the user’s times.