skip to Main Content

I am trying to retrieve data from public.users tables and this the query I am using but I am not able to retrieve, what i actually require:

WITH duplicates AS (
    SELECT email, COUNT(*) AS duplicate_count
    FROM public.users
    GROUP BY email
    HAVING COUNT(*) > 1)
SELECT u.user_id, u.user_name, d.duplicate_count
FROM public.users u
JOIN duplicates d ON u.email = d.email
WHERE u.registration_date >= CURRENT_DATE - INTERVAL '30 days';

I am trying to use the concept of Common Table Expression (CTE) named "duplicates" to find all email addresses that have more than one occurrence in the "public.users" table.
I might be making some mistake, because despite having duplicate emails I am not able to fetch them using this particular query.
I will be grateful if you could help!

3

Answers


  1. The key issues in your query:

    The CTE itself needs the filter on registration_date, not the outer query, to only consider recent emails for counting duplicates.

    When joining the CTE back to the users table, you again need the registration_date filter in the join condition to only get recent duplicates.

        WITH duplicates AS (
        SELECT email, COUNT(*) AS duplicate_count
        FROM public.users
        WHERE registration_date >= CURRENT_DATE - INTERVAL '30 days'
        GROUP BY email
        HAVING COUNT(*) > 1)
        SELECT u.user_id, u.user_name, d.duplicate_count
        FROM public.users u
        JOIN duplicates d ON u.email = d.email
        WHERE u.registration_date >= CURRENT_DATE - INTERVAL '30 days';
    

    Here is an example for you to understand better:

    Let’s say there is an email that appears 3 times total in the table. But only 2 of them are within the last 30 days.

    If you don’t filter in the CTE, it would count 3 occurrences and include that email in the duplicate list.

    But in the outer query, only 2 of those rows would match the date filter. So you’d end up with incorrect data.

    By filtering in both parts, you ensure the logic is restricted to the right date range in all steps.

    Login or Signup to reply.
  2. the issue is with the GROUP BY clause in the CTE.

    you are grouping by the email column, but you are not including the user_id or user_name in the SELECT statement of the CTE.

    this could cause unexpected results when you later join the CTE with the public.users table.

    Login or Signup to reply.
  3. If you just want to get emails with entries during the last 30 days and a duplicate from any time:

    SELECT email, count(*) AS duplicate_count
    FROM   public.users
    GROUP  BY email
    HAVING count(*) > 1
    AND    max(u.registration_date) >= CURRENT_DATE - 30
    

    Also assuming registration_date is an actual date.

    If the table is big, with lots of older data, and duplicates often occur in close proximity, this more elaborate query might perform better:

    WITH email_cnt AS (
       SELECT email, count(*) AS duplicate_count
       FROM   public.users
       WHERE  u.registration_date >= CURRENT_DATE - 30
       GROUP  BY email
       )
    SELECT u.user_id, u.user_name, e.duplicate_count  -- local dupes
    FROM   email_cnt e
    JOIN   public.users u USING (email)
    WHERE  e.duplicate_count > 1
    
    UNION  -- not UNION ALL, because both may apply
    SELECT u.user_id, u.user_name, e.duplicate_count  -- remote dupes
    FROM   email_cnt e
    JOIN   public.users u USING (email)
    WHERE  e.duplicate_count = 1
    AND    EXISTS (
       SELECT FROM public.users u1
       WHERE  u1.email = e.email
       AND    u1.registration_date < CURRENT_DATE - 30
       );
    

    This time also including details for the latest entry.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search