skip to Main Content

I’m learning SQL from a website and they have this question for which my attempt below isn’t sufficient and needs input.

Summary of question:
"Find the difference in days between the last post and the first post of the year 2021."

Data:

Table posts:

Column Name   Type
user_id       integer
post_id       integer
post_date     timestamp
post_content  text

Sample data:

user_id post_id post_date           post_content
151652  599415  07/10/2021 12:00:00 Need a hug
661093  624356  07/29/2021 13:00:00 Bed. Class 8-12.
004239  784254  07/04/2021 11:00:00 Happy 4th of July!
661093  442560  07/08/2021 14:00:00 Just going to cry myself to sleep 
151652  111766  07/12/2021 19:00:00 I'm so done with covid - need travelling ASAP!

Example output:

user_id days_between
151652  2
661093  21

My attempt at answering has required a lot of reading and the best I’ve come up with is this for PostgreSQL:

WITH
target_year AS
(
  SELECT *
  FROM
  (
    SELECT 
      user_id,
      post_date, 
      dense_rank() over(PARTITION BY user_id order by user_id desc) as n
    FROM posts
    where EXTRACT('year' from post_date)=2021
    order by user_id DESC
  ) as x
  where n <=2
),

target_user AS
(
SELECT  user_id,
  post_date::date - lag(post_date::date) over (order by user_id) from target_year
)

select * from target_user
order by user_id

The results I get are:

user_id column
151325  NULL
151652  -207
151652  244
151652  -307
178425  55
423967  42
661093  105
661093  21
661093  -206

I am confused by the above

  1. I selected where n <=2 in the with clause target_year. I’m expecting a maximum of only two results for each group, but the above shows at least three are coming though. How is my logic broken?

  2. My understanding is that by using the type ::date it will force the date into days. How do I make the dates absolute, and also why are they negative? My understanding of my logic is that because I order by date descending the top two results should only be subtracting from a larger number (post_date::date) from a smaller number (lag(post_date::date)) so the presence of negative numbers should be impossible, how is my logic wrong?

  3. Also when I change the logic from where n <=2 to where n >=2 in the with clause target_year. I do not get any results back. I was expecting at least those with two and three entries to be there. Why is this logic not working?

2

Answers


  1. Can be done with plain aggregates in a single query level:

    SELECT user_id
         , max(post_date)::date - min(post_date)::date AS days_between
    FROM   posts
    WHERE  post_date >= '2021-01-01'
    AND    post_date <  '2022-01-01'
    GROUP  BY user_id
    HAVING count(*) > 1
    ORDER  BY user_id;
    

    Use "sargable" predicates for the date range. See:

    Login or Signup to reply.
  2. As @ErwinBrandstetter noted above, the usual way is a simple grouping.
    If you implement your approach to solving the task, you can do it like this

    WITH target_year AS(
        SELECT user_id,post_date, 
          row_number()over(PARTITION BY user_id order by post_date asc) as n_asc,
          row_number()over(PARTITION BY user_id order by post_date desc) as n_desc
        FROM posts
        where EXTRACT('year' from post_date)=2021
    ),
    target_user AS
    (
    SELECT  user_id,
      post_date::date - lag(post_date::date)over(partition by user_id order by post_date) diff 
      from target_year
      where n_asc=1 or n_desc=1
    )
    select * from target_user
    where diff is not null
    order by user_id
    

    Filter where n_asc=1 or n_desc=1 leaves the first and last row in the selection for each Id. If first and last row is the same lag(…) gets null, and diff is null.

    As for the use of the rank() window function, it can give more than a single line under the number n=1. Then the decision will be wrong. Therefore, you need to use the row_number() function.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search