Postgresql - SQL to calculate number of values for each day based on a max value from previous date

5w0rdf1sh
July 30, 2023
276 views
0 votes
2 Answers

There is a table in PostgreSQL database which stores data about words from a book and pages where those words occur. Something like this:

 headword | dict_pages       |   timestamp_updated    
----------+------------------+------------------------
 abcdefg  | {229}            | 2023-07-28 14:49:13+00
 ccdsd    | {213}            | 2023-07-28 18:48:11+00
 zdx      | {228}            | 2023-07-27 18:37:42+00
 xcdferc  | {227, 228}       | 2023-07-27 14:47:55+00

I would like to know, how many pages were processed for any particular day. Here is how I am doing this now:

SELECT
    to_char(timestamp_updated, 'YYYY-MM-DD') as dt,
    count(distinct(page_num))
FROM dictionary_word dtw, unnest(dict_pages) page_num
WHERE dict_pages IS NOT NULL
GROUP BY dt
ORDER BY dt ASC

My goals is to get how many new pages were processed. Old pages that were processed some time ago are not relevant for this statistics.

From example above, with my existing query I will get the following:

2023-07-27 - 2
2023-07-28 - 2

But word ccdsd were added few days/weeks/months ago already, so page 213 should be excluded from calculation of number of processed pages. I think that solution for this might be getting maximum page number from the previous day, 2023-07-27 which is 228, then for the current day calculate only numbers (pages) which are greater than 228.

Is there any elegant way to achieve this with SQL?

Tags: postgresql sql

Answers

DB fiddle with step-by-step queries

I think it’s not the most elegant way, but a query could be:

-- get correct statistics about processed pages
with prev_max_page_data as (
  with max_page_data as (
    select dtw.headword,
         to_char(timestamp_updated, 'YYYY-MM-DD') as time_upd,
         -- select max_page_num from 'dict_pages' unnested array value
         max(page_num::integer) over(partition by dtw.headword) as max_page_num
    from dictionary_word dtw, unnest(dict_pages) page_num
  )
  select mpd.*,
    -- get prev_max_page for each date using lag() window function 
    coalesce(lag(max_page_num) over (order by time_upd), 0) as prev_max_page
  from max_page_data mpd
)
select time_upd, count(distinct(headword)) as pages_count
from prev_max_page_data
where max_page_num >= prev_max_page
group by time_upd;

Details:

max_page_data query helps to get max page for each dict_pages jsonb[] array.
prev_max_page_data query helps to identify previous max page for each date.
Outer query filters values using filter where max_page_num >= prev_max_page and calculates the final statistics about processed pages.

- Ajax1234
- July 30, 2023 at 6:04 pm
- 0 votes
0
The query below first unnests the page number arrays, and then during the aggregation, filters out any headword with a corresponding dict_page that has been processed at an earlier date (i.e there exists a dict_page greater than or equal to the dict_page in question with a smaller timestamp):
```
with dt as (
   select date(d.timestamp_updated) date, d.headword, c::int 
   from dictionary_word d cross join unnest(d.dict_pages) c
)
select d.date, count(distinct d.headword) c from dt d 
where not exists (select 1 from dt d1 where d.c <= d1.c and d1.date < d.date)
group by d.date
```
See fiddle
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – SQL to calculate number of values for each day based on a max value from previous date

Answers