skip to Main Content

There is a table in PostgreSQL database which stores data about words from a book and pages where those words occur. Something like this:

 headword | dict_pages       |   timestamp_updated    
----------+------------------+------------------------
 abcdefg  | {229}            | 2023-07-28 14:49:13+00
 ccdsd    | {213}            | 2023-07-28 18:48:11+00
 zdx      | {228}            | 2023-07-27 18:37:42+00
 xcdferc  | {227, 228}       | 2023-07-27 14:47:55+00

I would like to know, how many pages were processed for any particular day. Here is how I am doing this now:

SELECT
    to_char(timestamp_updated, 'YYYY-MM-DD') as dt,
    count(distinct(page_num))
FROM dictionary_word dtw, unnest(dict_pages) page_num
WHERE dict_pages IS NOT NULL
GROUP BY dt
ORDER BY dt ASC

My goals is to get how many new pages were processed. Old pages that were processed some time ago are not relevant for this statistics.

From example above, with my existing query I will get the following:

2023-07-27 - 2
2023-07-28 - 2

But word ccdsd were added few days/weeks/months ago already, so page 213 should be excluded from calculation of number of processed pages. I think that solution for this might be getting maximum page number from the previous day, 2023-07-27 which is 228, then for the current day calculate only numbers (pages) which are greater than 228.

Is there any elegant way to achieve this with SQL?

2

Answers


  1. DB fiddle with step-by-step queries

    I think it’s not the most elegant way, but a query could be:

    -- get correct statistics about processed pages
    with prev_max_page_data as (
      with max_page_data as (
        select dtw.headword,
             to_char(timestamp_updated, 'YYYY-MM-DD') as time_upd,
             -- select max_page_num from 'dict_pages' unnested array value
             max(page_num::integer) over(partition by dtw.headword) as max_page_num
        from dictionary_word dtw, unnest(dict_pages) page_num
      )
      select mpd.*,
        -- get prev_max_page for each date using lag() window function 
        coalesce(lag(max_page_num) over (order by time_upd), 0) as prev_max_page
      from max_page_data mpd
    )
    select time_upd, count(distinct(headword)) as pages_count
    from prev_max_page_data
    where max_page_num >= prev_max_page
    group by time_upd;
    

    Details:

    1. max_page_data query helps to get max page for each dict_pages jsonb[] array.
    2. prev_max_page_data query helps to identify previous max page for each date.
    3. Outer query filters values using filter where max_page_num >= prev_max_page and calculates the final statistics about processed pages.
    Login or Signup to reply.
  2. The query below first unnests the page number arrays, and then during the aggregation, filters out any headword with a corresponding dict_page that has been processed at an earlier date (i.e there exists a dict_page greater than or equal to the dict_page in question with a smaller timestamp):

    with dt as (
       select date(d.timestamp_updated) date, d.headword, c::int 
       from dictionary_word d cross join unnest(d.dict_pages) c
    )
    select d.date, count(distinct d.headword) c from dt d 
    where not exists (select 1 from dt d1 where d.c <= d1.c and d1.date < d.date)
    group by d.date
    

    See fiddle

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search