I have a table that records the login information of users on a website, including ID and login date.
For example:
id | user_id | date |
---|---|---|
1 | 1234 | 2024-01-01 |
2 | 2341 | 2024-01-07 |
3 | 2341 | 2024-01-13 |
4 | 3412 | 2024-01-20 |
5 | 4123 | 2024-01-20 |
6 | 4123 | 2024-01-21 |
7 | 4123 | 2024-01-22 |
8 | 2341 | 2024-01-07 |
9 | 2341 | 2024-01-22 |
10 | 3412 | 2024-01-20 |
11 | 4123 | 2024-01-20 |
12 | 4123 | 2024-01-21 |
13 | 4123 | 2024-01-22 |
14 | 2341 | 2024-01-23 |
15 | 2341 | 2024-01-25 |
16 | 3412 | 2024-01-20 |
Note that the table will contain duplicate data because users may log in multiple times a day.
I now want to calculate the average next day retention rate of users, for the example data above, the average next-day retention rate would be 0.3.
I tried the following query, but the result I got was 0.4. I think my idea should be right, but I don’t know why the result is wrong.
Can anyone help me figure out what the problem is or give me a correct answer? (I use PieCloudDB Database. If you haven’t heard of it, you can use PostgreSQL instead. PieCloudDB is compatible with PostgreSQL)
SELECT COUNT(DISTINCT (q2.user_id,q2.date))*1.0/COUNT(DISTINCT (q1.user_id,q1.date)) as avg_ret
from login as q1
left join login as q2
on q1.user_id=q2.user_id
and q2.date = q1.date + interval '1 day'
3
Answers
I tried deduplicating the id and date separately and the result I got was 0.3. (Tested on the pieclouddb database cloud platform)
The error is caused by
COUNT(DISTINCT q2.user_id, q2.date)
includingNULL, NULL
as one of the distinct combinations. The following query produces the correct result of 0.3:Another option to solve this is to avoid the
LEFT JOIN
and useCASE
andEXISTS
logic.The "problem" with
LEFT JOIN
is you need to handleNULL
values, as well explained in this answer. This is not required when usingEXISTS
.At the other side, some people prefer using
JOIN
over subqueries and sometimes, the performance is better. I think both ways are fine, just try out.This query returns 0.3, see this sample fiddle with your data
Note: I renamed the column
date
tologin_date
in my answer becausedate
is a key word and it’s recommended to avoid key words as table name or as column name.