I have a table logs_bl_sj
, which is ordered by startdate
:
bundesland | startdate | enddate |
---|---|---|
‘Hessen’ | 2015-02-26 16:22:21 | 2015-02-26 16:31:31 |
‘Hessen’ | 2015-10-20 22:34:54 | 2015-10-20 22:35:03 |
‘Bremen’ | 2015-10-20 22:35:50 | 2015-10-20 22:37:03 |
… |
I want to find for each row r
, how many rows x
exist in this table where:
x.startdate <= r.startdate
and r.startdate < x.enddate
and r.bundesland = x.bundesland
In other words, for each startdate
s
, I want to find the number of timeranges [a, b)
containing s
with the same value for bundesland
(which is always at least one: s
is always contained in [s, b)
).
Notice how useful the order of the table is: for each row, rows that come after this row will not count and so should not even be checked.
How can I make use of this fact through PostgreSQL? i.e. how can I make the server ignore all rows after each row when calculating for that row?
I am close to a query that gets the right data, however it is without the above mentioned optimization. Here’s what I have:
SELECT bundesland, startdate, COUNT(time_range) FILTER (WHERE time_range @> startdate::timestamp) OVER (PARTITION BY bundesland)
FROM logs_bl_sj_timerange
where logs_bl_sj_timerange
is logs_bl_sj
from above, but with an added column time_range
that is just tsrange
with [startdate, enddate)
.
The COUNT
just returns the number of time_ranges in the bundesland
… I expected the number of time ranges in that bundesland
which contain startdate
.
Bonus question: Would it be better to do this procedurally e.g. in Python? Iterating over the ordered startdates, one could keep a running count that changes based on a stored array of enddates… whereas PostgreSQL will have to start a fresh count for each row…
2
Answers
You can solve this problem by using window functions and appropriate sorting, without the need for iteration in Python. Since your table is already sorted by startdate, you can use the LAG window function to obtain the enddate of the previous row and then check if the startdate of the current row is within this range.
As for your additional questions, in general, using SQL (especially when its built-in optimizations and features can be utilized) is more efficient than iterating in Python. This is because database management systems (such as PostgreSQL) are often highly optimized for this type of data operation. Of course, this also depends on the specific situation, including data volume, database configuration, and indexing. In this specific situation, I believe using SQL is a more appropriate choice.
There are multiple ways to solve this and using lateral is one of them. Its effectiveness depends on the data (how selective given a bundesland + startdate combination is):
As per the bonus question, I don’t think Python would be faster. You can try it yourself on a large data set. And if you have a large data set, doing this with some extension might be much better (ie: timescaledb).
DBFiddle demo
EDIT: Note that postgreSQL would utilize an index on startDate pretty well.