I am working with the ml100k dataset to write a query that can fetch me the most rated movie for every age.
Here’s how my tables are defined.
users
id | age | gender | occupation | zipcode
ratings
userid | movieid | rating | ts
What I’ve done so far
- Since the count wasn’t explicitly defined, I wrote a query to find it.
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
This gave me the count for every movie for every age.
10 1 1
11 1 1
13 1 3
14 1 1
15 1 2
16 1 4
17 1 4
18 1 6
19 1 15
20 1 22
21 1 14
- To find the maximum for each age
SELECT age, MAX(mcount) AS mc
FROM (
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
) t1
GROUP BY age
7 1
10 1
11 1
13 5
14 3
15 5
16 5
17 11
18 16
19 21
20 25
21 23
This gave me the age and the max count. However, I also want the corresponding movie ID and this where I have been stuck. My thought process is to join these results with the first table but it is not working. Are there any other alternatives I can try?
This is the query I used.
SELECT users.age, ratings.movieid, count(*) as mc2
FROM ratings JOIN users ON id = userid
INNER JOIN
(
SELECT age, MAX(mcount) AS mc
FROM (
SELECT age, movieid, COUNT(*) AS mcount
FROM ratings
JOIN users ON id = userid
GROUP BY age, movieid
) t1
GROUP BY age
)t2
ON t2.age = users.age
WHERE mc2=t2.mc
GROUP BY users.age, ratings.movieid;
2
Answers
You can do it like this:
Explanation:
ratings
andusers
on
theid
ofusers
age
andmovieid
LEFT JOIN
the two groups byage
movieid
(so we will have sensible count comparisons)mcount
being smaller than the secondWHERE
clause we exclude the scenario when we found higher count matches in the second group than in the first groupFiddle: http://sqlfiddle.com/#!9/f8defe/6