I’m making a database in PostgreSQL that involves around democracy. All data that should be displayed are controlled by the users, and their percentage of power.
I’m struggling to write this SQL query where a collection of tags
on a post
should only be shown once the sum of all the percentage for each tag reaches a certain criteria.
The relations between the tables (relevant to this question) looks like this:
The post_tags
table is used for deciding what tag stays on what post, decided by the users based on their percentage.
It may look something like this
approved_by | post_id | tag_id | percentage |
---|---|---|---|
1 | 1 | 1 | 0.33 |
5 | 1 | 3 | 0.45 |
7 | 1 | 3 | 0.25 |
6 | 1 | 3 | 0.15 |
4 | 1 | 1 | 0.90 |
1 | 1 | 2 | 0.45 |
1 | 1 | 6 | -0.60 |
6 | 1 | 2 | -0.15 |
How do you write an SQL query that selects a post and its tags if the percentage sum is above a certain threshold?
In the case of SUM(post_tags.percentage) > 0.75
, only tag with tag_id
1 and 3 should show.
So far, I have written this SQL query, but it contains duplicates in the array_agg
(might be a separate issue), and the HAVING
only seem to depend on the total sum of all the tags in the array_agg
.
SELECT
posts.post_id, array_agg(tags.name) AS tags
FROM
posts, tags, post_tags
WHERE
post_tags.post_id = posts.post_id AND
post_tags.tag_id = tags.tag_id
GROUP BY
posts.post_id
HAVING
SUM(post_tags.percentage) > 0.75
LIMIT 10;
I assume I might need to do a subquery within the query, but you can’t do SUM
inside the WHERE
clause. I’m a bit lost at this issue.
Any help is appreciated
UPDATE
Because I think there needs to be atleast 2 queries into play, I think this should be one of them
SELECT
tags.name
FROM
post_tags, posts, tags
WHERE
post_tags.tag_id = tags.tag_id AND
post_tags.post_id = posts.post_id AND
posts.post_id = 1
GROUP BY
tags.tag_id
HAVING
SUM(post_tags.percentage) > 0.75
In this case, it’s only for post 1, and I don’t know how to continue this query for all posts
2
Answers
It's easy to get confused, but start small, and then expand the SQL query as you go.
Note the inner parenthesis will execute first. Start with the inner query, and then work on the outer query when building SQL queries.
In this case, for finding the tags relevant for a single post can be written like so
To expand on this, and apply the query for every post, switch out the
1
and set it equal the outer scope. The complete SQL query becomes this:Big thanks to Tim Biegeleisen who helped change out the
FROM
statements toINNER JOIN
(tho performance-wise, both are tested equally fast in this case).One idea would be to first aggregate the total percentage for each post/tag pair in a subquery. The subquery gives you a new join table
unique_post_tags
(one entry per post and tag, including thetotal_percentage
for each post/tag pair). You can then select frompost_tags_unique
in the outer query, filtering irrelevant tags in theWHERE
clause:To actually select the tag names per post and group it into an array as you requested, the above query can be extended like this:
Update
After looking at your answer more closely, I now realize that my idea of reducing the join table to the relevant entries first, can be implemented entirely in the subquery using the
GROUP BY
/HAVING
approach you initially suggested:Or written as CTE (for readability):
If the
0.75
limit is static, you could also create arelevant_post_tags
view in the DB and select from there directly. I did not look at the performance of the above (my guess would be that the query optimizer takes care of it, just note that using CTEs had some pitfalls in earlier Postgres versions).The approach I came up with is a bit different from what you initially asked for, the result set for the queries above will only contain posts that actually have tags.
If you need to select all posts, you can expand like this:
Or closer to your solution: