First of all, sorry for bad title – I can’t figure out how to write generalized formulation of my problem.
I have a table in PostgreSQL with users and dates of their internships. It looks like this:
user_id | start | end |
---|---|---|
1 | December 22, 2019 | June 29, 2020 |
2 | March 8, 2020 | September 8, 2020 |
3 | May 21, 2020 | November 21, 2020 |
From this I need to calculate for each month, how many people were on internship during this month. I only need to calculate full months (if internship actually started on December 22, 2019, I will calculate from January 2022. If internship were finished at June 29, 2020, I will calculate till May 2020.
Finally I need this table:
Month | Count |
---|---|
Jan-20 | 1 |
Feb-20 | 1 |
Mar-20 | 1 |
Apr-20 | 2 |
May-20 | 2 |
Jun-20 | 2 |
Jul-20 | 2 |
Aug-20 | 2 |
Sep-20 | 1 |
Oct-20 | 1 |
For making it absolutely clear, this is how I got it:
Month | user_1 | user_2 | user_3 | Count |
---|---|---|---|---|
Jan-20 | 1 | 1 | ||
Feb-20 | 1 | 1 | ||
Mar-20 | 1 | 1 | ||
Apr-20 | 1 | 1 | 2 | |
May-20 | 1 | 1 | 2 | |
Jun-20 | 1 | 1 | 2 | |
Jul-20 | 1 | 1 | 2 | |
Aug-20 | 1 | 1 | 2 | |
Sep-20 | 1 | 1 | ||
Oct-20 | 1 | 1 |
My idea is to:
- Reshape my initial table, so it will look like this:
user_id | date | event |
---|---|---|
1 | December 22, 2019 | start |
1 | June 29, 2020 | end |
2 | March 8, 2020 | start |
2 | September 8, 2020 | end |
3 | May 21, 2020 | start |
3 | November 21, 2020 | end |
- Generate series between each start and end event:
user_id | month |
---|---|
1 | Jan-20 |
1 | Feb-20 |
1 | Mar-20 |
1 | Apr-20 |
1 | May-20 |
2 | Apr-20 |
2 | May-20 |
2 | Jun-20 |
2 | Jul-20 |
2 | Aug-20 |
3 | Jun-20 |
3 | Jul-20 |
3 | Aug-20 |
3 | Sep-20 |
3 | Oct-20 |
- Using count() GROUP BY month
Unfortunately, I have problems with 1 and 2 clauses.
I don’t know how to reshape the table in PostgreSQL. In Pandas I would use ‘stack’ function. For my case I can’t find the appropriate function.
Even if I can reshape it, I don’t understand how to make series of month for each user (shown above).
Please advise, what can be done here to solve my problem?
4
Answers
One approach could be to
GENERATE_SERIES
Check the demo here.
If you’re using a PostgreSQL legacy version, you can obtain the calendar table with a recursive query:
Check the demo here.
this query should do the job considering your table as
test
:test
.JOIN
clause calculates the intersections between the months and the date interval for each userWHERE
clause filters the rows where the date interval for a user corresponds to a full months.Result :
see results in dbfiddle
Here is how I accomplished this:
generate_series
start
andend
dates to ensure we are only including instances where a full month of the internship was completed.cross join
to generate acartesian product set
.WHERE
predicate to include instances where thedate_mm
is between the truncatedstart
andend
dates.SQL:
Result:
SQL Fiddle:
http://sqlfiddle.com/#!17/9f9f3/37
Example with date calendar as subquery – number sequence (10*12) – 10 year.
Test data:
Main query:
Somewhat complicated to check start_date is first day of month and end_date is last day of month
Fiddle here