skip to Main Content

I’m making a database in PostgreSQL that involves around democracy. All data that should be displayed are controlled by the users, and their percentage of power.

I’m struggling to write this SQL query where a collection of tags on a post should only be shown once the sum of all the percentage for each tag reaches a certain criteria.

The relations between the tables (relevant to this question) looks like this:

Image of relations between the tables

The post_tags table is used for deciding what tag stays on what post, decided by the users based on their percentage.

It may look something like this

approved_by post_id tag_id percentage
1 1 1 0.33
5 1 3 0.45
7 1 3 0.25
6 1 3 0.15
4 1 1 0.90
1 1 2 0.45
1 1 6 -0.60
6 1 2 -0.15

How do you write an SQL query that selects a post and its tags if the percentage sum is above a certain threshold?

In the case of SUM(post_tags.percentage) > 0.75, only tag with tag_id 1 and 3 should show.

So far, I have written this SQL query, but it contains duplicates in the array_agg (might be a separate issue), and the HAVING only seem to depend on the total sum of all the tags in the array_agg.

SELECT
    posts.post_id, array_agg(tags.name) AS tags
FROM
    posts, tags, post_tags
WHERE
    post_tags.post_id = posts.post_id AND
    post_tags.tag_id = tags.tag_id
GROUP BY
    posts.post_id
HAVING
    SUM(post_tags.percentage) > 0.75
LIMIT 10;

I assume I might need to do a subquery within the query, but you can’t do SUM inside the WHERE clause. I’m a bit lost at this issue.

Any help is appreciated


UPDATE

Because I think there needs to be atleast 2 queries into play, I think this should be one of them

SELECT
    tags.name
FROM
    post_tags, posts, tags
WHERE
    post_tags.tag_id = tags.tag_id AND
    post_tags.post_id = posts.post_id AND
    posts.post_id = 1
GROUP BY
    tags.tag_id
HAVING
    SUM(post_tags.percentage) > 0.75

In this case, it’s only for post 1, and I don’t know how to continue this query for all posts

2

Answers


  1. Chosen as BEST ANSWER

    It's easy to get confused, but start small, and then expand the SQL query as you go.

    Note the inner parenthesis will execute first. Start with the inner query, and then work on the outer query when building SQL queries.

    In this case, for finding the tags relevant for a single post can be written like so

    SELECT
        t.name
    FROM
        post_tags
    INNER JOIN tags t ON t.tag_id = post_tags.tag_id
    INNER JOIN posts p2 ON p2.post_id = post_tags.post_id AND p2.post_id = 1
    GROUP BY
        t.tag_id
    HAVING
        SUM(post_tags.percentage) > 0.75
    

    To expand on this, and apply the query for every post, switch out the 1 and set it equal the outer scope. The complete SQL query becomes this:

    SELECT p.post_id, ARRAY(
        SELECT
            t.name
        FROM
            post_tags
        INNER JOIN tags t ON t.tag_id = post_tags.tag_id
        INNER JOIN posts p2 ON p2.post_id = post_tags.post_id AND p2.post_id = p.post_id
        GROUP BY
            t.tag_id
        HAVING
            SUM(post_tags.percentage) > 0.75
    ) AS tags
    FROM posts p
    

    Big thanks to Tim Biegeleisen who helped change out the FROM statements to INNER JOIN (tho performance-wise, both are tested equally fast in this case).


  2. One idea would be to first aggregate the total percentage for each post/tag pair in a subquery. The subquery gives you a new join table unique_post_tags (one entry per post and tag, including the total_percentage for each post/tag pair). You can then select from post_tags_unique in the outer query, filtering irrelevant tags in the WHERE clause:

    SELECT unique_post_tags.post_id, unique_post_tags.tag_id FROM 
      (
        SELECT post_id, tag_id, sum(percentage) as total_percentage 
        FROM post_tags
        GROUP BY post_id, tag_id
      ) AS unique_post_tags
    WHERE unique_post_tag.total_percentage > 0.75
    

    To actually select the tag names per post and group it into an array as you requested, the above query can be extended like this:

    SELECT unique_post_tags.post_id, array_agg(t.name) AS tags FROM 
      (
        SELECT post_id, tag_id, sum(percentage) as total_percentage 
        FROM post_tags
        GROUP BY post_id, tag_id
      ) AS unique_post_tags
    LEFT JOIN tags t ON t.id = unique_post_tags.tag_id
    WHERE unique_post_tags.total_percentage > 0.75
    GROUP BY unique_post_tags.post_id
    

    Update

    After looking at your answer more closely, I now realize that my idea of reducing the join table to the relevant entries first, can be implemented entirely in the subquery using the GROUP BY/HAVING approach you initially suggested:

    SELECT relevant_post_tags.post_id , array_agg(t.name) AS tags FROM
      (
        SELECT post_id, tag_id
        FROM post_tags
        GROUP BY post_id, tag_id
        HAVING SUM(percentage) > 0.75
      ) AS relevant_post_tags
    LEFT JOIN tags t ON t.id = relevant_post_tags.tag_id
    GROUP BY relevant_post_tags.post_id;
    

    Or written as CTE (for readability):

    WITH relevant_post_tags AS (
      SELECT post_id, tag_id
      FROM post_tags
      GROUP BY post_id, tag_id
      HAVING SUM(percentage) > 0.75)
    )
    SELECT relevant_post_tags.post_id, array_agg(t.name) AS tags
    FROM relevant_post_tags
    LEFT JOIN tags t ON t.id = relevant_post_tags.tag_id
    GROUP BY relevant_post_tags.post_id;
    

    If the 0.75 limit is static, you could also create a relevant_post_tags view in the DB and select from there directly. I did not look at the performance of the above (my guess would be that the query optimizer takes care of it, just note that using CTEs had some pitfalls in earlier Postgres versions).


    The approach I came up with is a bit different from what you initially asked for, the result set for the queries above will only contain posts that actually have tags.

    If you need to select all posts, you can expand like this:

    WITH relevant_post_tags AS (
      SELECT post_id, tag_id
      FROM post_tags
      GROUP BY post_id, tag_id
      HAVING SUM(percentage) > 0.75
    )
    SELECT p.id, array_remove(array_agg(t.name), NULL) AS tags
    FROM posts p
    LEFT JOIN relevant_post_tags pt on pt.post_id = p.id
    LEFT JOIN tags t ON t.id = pt.tag_id
    GROUP BY p.id;
    

    Or closer to your solution:

    WITH relevant_post_tags AS (
      SELECT post_id, tag_id
      FROM post_tags
      GROUP BY post_id, tag_id
      HAVING SUM(percentage) > 0.75
    )
    SELECT p.id, ARRAY(
      SELECT t.name
      FROM relevant_post_tags pt
      JOIN tags t ON t.id = pt.tag_id 
      WHERE pt.post_id = p.id
    )
    FROM posts p;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search