Postgresql - Calculate the average next-day retention rate of users

MeliodasDragon
March 28, 2024
161 views
3 votes
3 Answers

I have a table that records the login information of users on a website, including ID and login date.

For example:

id	user_id	date
1	1234	2024-01-01
2	2341	2024-01-07
3	2341	2024-01-13
4	3412	2024-01-20
5	4123	2024-01-20
6	4123	2024-01-21
7	4123	2024-01-22
8	2341	2024-01-07
9	2341	2024-01-22
10	3412	2024-01-20
11	4123	2024-01-20
12	4123	2024-01-21
13	4123	2024-01-22
14	2341	2024-01-23
15	2341	2024-01-25
16	3412	2024-01-20

Note that the table will contain duplicate data because users may log in multiple times a day.

I now want to calculate the average next day retention rate of users, for the example data above, the average next-day retention rate would be 0.3.

I tried the following query, but the result I got was 0.4. I think my idea should be right, but I don’t know why the result is wrong.

Can anyone help me figure out what the problem is or give me a correct answer? (I use PieCloudDB Database. If you haven’t heard of it, you can use PostgreSQL instead. PieCloudDB is compatible with PostgreSQL)

SELECT COUNT(DISTINCT (q2.user_id,q2.date))*1.0/COUNT(DISTINCT (q1.user_id,q1.date)) as avg_ret
from login as q1 
left join login as q2
on q1.user_id=q2.user_id 
and q2.date = q1.date + interval '1 day'

Tags: postgresql sql

Answers

- Ban
- March 28, 2024 at 8:04 am
- 0 votes
0
I tried deduplicating the id and date separately and the result I got was 0.3. (Tested on the pieclouddb database cloud platform)
```
SELECT COUNT(q2.user_id)*1.0 / COUNT(q1.user_id) AS avg_ret
FROM (SELECT DISTINCT user_id, date 
  FROM login) AS q1
LEFT JOIN (SELECT DISTINCT user_id, date 
  FROM login) AS q2
ON q1.user_id = q2.user_id 
AND q2.date = q1.date + interval '1 day'
```
Login or Signup to reply.

- JohnH
- March 28, 2024 at 8:15 am
- 0 votes
0
The error is caused by COUNT(DISTINCT q2.user_id, q2.date) including NULL, NULL as one of the distinct combinations. The following query produces the correct result of 0.3:
```
SELECT
  COUNT(DISTINCT (q2.user_id, q2.date))
    FILTER (WHERE q2.date IS NOT NULL) * 1.0
  /
 COUNT(DISTINCT (q1.user_id, q1.date)) AS avg_ret
FROM
  login AS q1
  LEFT JOIN login AS q2
    ON q1.user_id = q2.user_id
      AND q2.date = q1.date + INTERVAL '1 day';
```
Login or Signup to reply.

- JonasMetzler
- March 28, 2024 at 8:16 am
- 0 votes
0
Another option to solve this is to avoid the LEFT JOIN and use CASE and EXISTS logic.

The "problem" with LEFT JOIN is you need to handle NULL values, as well explained in this answer. This is not required when using EXISTS.

At the other side, some people prefer using JOIN over subqueries and sometimes, the performance is better. I think both ways are fine, just try out.
```
WITH login_sub AS
  (SELECT
    DISTINCT user_id, login_date 
   FROM login)
SELECT 
  COUNT(CASE WHEN EXISTS 
    (SELECT 1 FROM login_sub l2 
     WHERE 
       login_sub.user_id = l2.user_id
       AND login_sub.login_date + INTERVAL '1 day' = l2.login_date)
  THEN 1 END) * 1.0 / COUNT(*) AS avg_ret
FROM login_sub;
```
This query returns 0.3, see this sample fiddle with your data

Note: I renamed the column date to login_date in my answer because date is a key word and it’s recommended to avoid key words as table name or as column name.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Calculate the average next-day retention rate of users

Answers