My table structure is roughly as described in this post:
name | processed | processing | updated | ref_time |
---|---|---|---|---|
abc | t | f | 27794395 | 27794160 |
def | t | f | 27794395 | 27793440 |
ghi | t | f | 27794395 | 27793440 |
jkl | f | f | 27794395 | 27794160 |
mno | t | f | 27794395 | 27793440 |
pqr | f | t | 27794395 | 27794160 |
I created a dbfiddle already based on this table structure (more on this below), so there is no need to create your own.
Based on this answer, I am deriving a list of ref_time
values to use as a basis for deleting ‘old’ entries from status_table
:
with
ref as (
select ref_time
from status_table
group by ref_time
having bool_and(processed)
order by ref_time desc
offset 1
)
delete from status_table s
using ref r
where s.ref_time = r.ref_time
But now I want to be more sophisticated about what I use as the offset
… I would ideally like to keep the most recent ref_time
for which all records are processed (as per the above example where offset
is 1
), but the two most recent ref_time
where the second ref_time
has more associated records than the first (i.e. offset
needs to be 2
to skip over the two most recent ref_time
).
I figure that the following query (based on this answer) will help in this task, because it counts the total number of processed
records based on ref_time
:
select ref_time,
count(*) cnt_total,
count(*) filter(where processed) cnt_processed,
round(avg(processed::int),2) ratio_processed
from status_table
group by ref_time
order by ratio_processed desc, ref_time desc;
So in this dbfiddle I’d need to preserve ref_time=27794160
(rather than include it in the delete list as is the case in the example) because, although it is second, it also has a higher cnt_total
than the first.
In general, the rule is that I want to keep all ref_time
up to (but not including) the ref_time
having the same cnt_total
as the one before (or less).
2
Answers
Here’s a simple translation into window functions:
ref_time
rows same as before, keeping only the ones that are fully processed.ref_time
‘s, plus thecnt_total
of the precedingref_time
, one higher in this ranking.ref_time
in this order and if the second one’scnt_total
is higher, takes it too.ref_time
‘s found in 1., unless they were picked in 3., returning everything it deleted, for you to inspect.Demo with additional test cases.
fiddle
Subquery
sub1
is mostly what you already had. Plus some more:We only need
count(*)
sinceHAVING bool_and(processed)
excludes other cases anyway.lag(count(*)) OVER (ORDER BY ref_time DESC) >= count(*)
checks whether the previous row has been same or greater, in which case, we’ll call that a "drop". And we want to keep all rows before the first "drop".So
count(*) FILTER (WHERE drop) OVER (ORDER BY ref_time DESC) AS drops
insub2
counts those "drops", andWHERE drops > 0
eliminate the leading rows we want to preserve.About this "gaps-and-islands" approach:
If there can be concurrent writes on
status_table
, you may need to add the locking clauseFOR UPDATE
to be sure. But you cannot do that in an aggregate query, so you would add another subquery where you do that … See: