I have a Postgres database. It has two tables that we’re concerned with right now. Users
has a Name
field and an ID
field, and Events
has a UserID
field (FK to Users.ID
), EventType
field (typed as integer
), and Time field (typed as datetime
).
I want to find, for each user, the longest streak of consecutive days in which they have generated at least one event of type 3
. The answer should return one row for each user, with the user’s name, the date on which the longest streak began, and the number of days in the streak. Joining the tables to get the name is trivial, but I can’t figure out any good way to determine the streaks. It feels like it should be an aggregate, but there aren’t any applicable aggregates that I can see.
2
Answers
The solution for this is based on the following steps
Step 1: Order the data in asc order for each user and assign row_number
Step 2: calculate the estimated start_date by substracting row_number from the date
Step 3: group by estimated start_date and the user to get the consicutive number of days for which the user has loggedIn
The general approach for "gaps and islands" problems is to identify the beginning of each island and then attach a group identification tag to each row that is part of the island. These tags facilitate additional operations on the data in each island.
The following creates the demonstration tables:
These tables are populated with random events with the following block (the parameters were chosen to give a reasonable probability that at least one user would have more than one longest run with a data set that is small enough to be manually explored):
The following query demonstrates one approach to achieve the desired results (I’ve added a column to report the number of selected events even though it’s not required):
The first CTE (user_event_dates) aggregates the qualifying events by day for each user. The next CTE (group_increments) sets an increment value for each user day that starts a new contiguous series. The third CTE (user_event_groups) tags each user event day with a group number. The main query then selects the earliest longest run for each user. Changing the order on
start_date
to beDESC
will cause the query to return the most recent longest run. Similarly, changing the order toASC
onnum_days
, will return a shortest run for each user.