I’m learning SQL from a website and they have this question for which my attempt below isn’t sufficient and needs input.
Summary of question:
"Find the difference in days between the last post and the first post of the year 2021."
Data:
Table posts
:
Column Name Type
user_id integer
post_id integer
post_date timestamp
post_content text
Sample data:
user_id post_id post_date post_content
151652 599415 07/10/2021 12:00:00 Need a hug
661093 624356 07/29/2021 13:00:00 Bed. Class 8-12.
004239 784254 07/04/2021 11:00:00 Happy 4th of July!
661093 442560 07/08/2021 14:00:00 Just going to cry myself to sleep
151652 111766 07/12/2021 19:00:00 I'm so done with covid - need travelling ASAP!
Example output:
user_id days_between
151652 2
661093 21
My attempt at answering has required a lot of reading and the best I’ve come up with is this for PostgreSQL:
WITH
target_year AS
(
SELECT *
FROM
(
SELECT
user_id,
post_date,
dense_rank() over(PARTITION BY user_id order by user_id desc) as n
FROM posts
where EXTRACT('year' from post_date)=2021
order by user_id DESC
) as x
where n <=2
),
target_user AS
(
SELECT user_id,
post_date::date - lag(post_date::date) over (order by user_id) from target_year
)
select * from target_user
order by user_id
The results I get are:
user_id column
151325 NULL
151652 -207
151652 244
151652 -307
178425 55
423967 42
661093 105
661093 21
661093 -206
I am confused by the above
-
I selected
where n <=2
in thewith
clausetarget_year
. I’m expecting a maximum of only two results for each group, but the above shows at least three are coming though. How is my logic broken? -
My understanding is that by using the type
::date
it will force the date into days. How do I make the dates absolute, and also why are they negative? My understanding of my logic is that because I order by date descending the top two results should only be subtracting from a larger number (post_date::date
) from a smaller number (lag(post_date::date)) so the presence of negative numbers should be impossible, how is my logic wrong? -
Also when I change the logic from
where n <=2
towhere n >=2
in thewith
clausetarget_year
. I do not get any results back. I was expecting at least those with two and three entries to be there. Why is this logic not working?
2
Answers
Can be done with plain aggregates in a single query level:
Use "sargable" predicates for the date range. See:
As @ErwinBrandstetter noted above, the usual way is a simple grouping.
If you implement your approach to solving the task, you can do it like this
Filter
where n_asc=1 or n_desc=1
leaves the first and last row in the selection for each Id. If first and last row is the same lag(…) gets null, and diff is null.As for the use of the rank() window function, it can give more than a single line under the number n=1. Then the decision will be wrong. Therefore, you need to use the row_number() function.