skip to Main Content

I have a table that records the login information of users on a website, including ID and login date.

For example:

id user_id date
1 1234 2024-01-01
2 2341 2024-01-07
3 2341 2024-01-13
4 3412 2024-01-20
5 4123 2024-01-20
6 4123 2024-01-21
7 4123 2024-01-22
8 2341 2024-01-07
9 2341 2024-01-22
10 3412 2024-01-20
11 4123 2024-01-20
12 4123 2024-01-21
13 4123 2024-01-22
14 2341 2024-01-23
15 2341 2024-01-25
16 3412 2024-01-20

Note that the table will contain duplicate data because users may log in multiple times a day.

I now want to calculate the average next day retention rate of users, for the example data above, the average next-day retention rate would be 0.3.

I tried the following query, but the result I got was 0.4. I think my idea should be right, but I don’t know why the result is wrong.

Can anyone help me figure out what the problem is or give me a correct answer? (I use PieCloudDB Database. If you haven’t heard of it, you can use PostgreSQL instead. PieCloudDB is compatible with PostgreSQL)

SELECT COUNT(DISTINCT (q2.user_id,q2.date))*1.0/COUNT(DISTINCT (q1.user_id,q1.date)) as avg_ret
from login as q1 
left join login as q2
on q1.user_id=q2.user_id 
and q2.date = q1.date + interval '1 day'

3

Answers


  1. I tried deduplicating the id and date separately and the result I got was 0.3. (Tested on the pieclouddb database cloud platform)

    SELECT COUNT(q2.user_id)*1.0 / COUNT(q1.user_id) AS avg_ret
    FROM (SELECT DISTINCT user_id, date 
      FROM login) AS q1
    LEFT JOIN (SELECT DISTINCT user_id, date 
      FROM login) AS q2
    ON q1.user_id = q2.user_id 
    AND q2.date = q1.date + interval '1 day'
    
    Login or Signup to reply.
  2. The error is caused by COUNT(DISTINCT q2.user_id, q2.date) including NULL, NULL as one of the distinct combinations. The following query produces the correct result of 0.3:

    SELECT
      COUNT(DISTINCT (q2.user_id, q2.date))
        FILTER (WHERE q2.date IS NOT NULL) * 1.0
      /
     COUNT(DISTINCT (q1.user_id, q1.date)) AS avg_ret
    FROM
      login AS q1
      LEFT JOIN login AS q2
        ON q1.user_id = q2.user_id
          AND q2.date = q1.date + INTERVAL '1 day';
    
    Login or Signup to reply.
  3. Another option to solve this is to avoid the LEFT JOIN and use CASE and EXISTS logic.

    The "problem" with LEFT JOIN is you need to handle NULL values, as well explained in this answer. This is not required when using EXISTS.

    At the other side, some people prefer using JOIN over subqueries and sometimes, the performance is better. I think both ways are fine, just try out.

    WITH login_sub AS
      (SELECT
        DISTINCT user_id, login_date 
       FROM login)
    SELECT 
      COUNT(CASE WHEN EXISTS 
        (SELECT 1 FROM login_sub l2 
         WHERE 
           login_sub.user_id = l2.user_id
           AND login_sub.login_date + INTERVAL '1 day' = l2.login_date)
      THEN 1 END) * 1.0 / COUNT(*) AS avg_ret
    FROM login_sub;
    

    This query returns 0.3, see this sample fiddle with your data

    Note: I renamed the column date to login_date in my answer because date is a key word and it’s recommended to avoid key words as table name or as column name.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search