There is a table in PostgreSQL database which stores data about words from a book and pages where those words occur. Something like this:
headword | dict_pages | timestamp_updated
----------+------------------+------------------------
abcdefg | {229} | 2023-07-28 14:49:13+00
ccdsd | {213} | 2023-07-28 18:48:11+00
zdx | {228} | 2023-07-27 18:37:42+00
xcdferc | {227, 228} | 2023-07-27 14:47:55+00
I would like to know, how many pages were processed for any particular day. Here is how I am doing this now:
SELECT
to_char(timestamp_updated, 'YYYY-MM-DD') as dt,
count(distinct(page_num))
FROM dictionary_word dtw, unnest(dict_pages) page_num
WHERE dict_pages IS NOT NULL
GROUP BY dt
ORDER BY dt ASC
My goals is to get how many new pages were processed. Old pages that were processed some time ago are not relevant for this statistics.
From example above, with my existing query I will get the following:
2023-07-27 - 2
2023-07-28 - 2
But word ccdsd
were added few days/weeks/months ago already, so page 213
should be excluded from calculation of number of processed pages. I think that solution for this might be getting maximum page number from the previous day, 2023-07-27
which is 228
, then for the current day calculate only numbers (pages) which are greater than 228
.
Is there any elegant way to achieve this with SQL?
2
Answers
DB fiddle with step-by-step queries
I think it’s not the most elegant way, but a query could be:
Details:
max_page_data
query helps to get max page for eachdict_pages
jsonb[] array.prev_max_page_data
query helps to identify previous max page for each date.where max_page_num >= prev_max_page
and calculates the final statistics about processed pages.The query below first
unnest
s the page number arrays, and then during the aggregation, filters out any headword with a correspondingdict_page
that has been processed at an earlier date (i.e there exists adict_page
greater than or equal to thedict_page
in question with a smaller timestamp):See fiddle