skip to Main Content

I have a table logs_bl_sj, which is ordered by startdate:

bundesland startdate enddate
‘Hessen’ 2015-02-26 16:22:21 2015-02-26 16:31:31
‘Hessen’ 2015-10-20 22:34:54 2015-10-20 22:35:03
‘Bremen’ 2015-10-20 22:35:50 2015-10-20 22:37:03

I want to find for each row r, how many rows x exist in this table where:

x.startdate <= r.startdate and r.startdate < x.enddate and r.bundesland = x.bundesland

In other words, for each startdate s, I want to find the number of timeranges [a, b) containing s with the same value for bundesland (which is always at least one: s is always contained in [s, b)).

Notice how useful the order of the table is: for each row, rows that come after this row will not count and so should not even be checked.

How can I make use of this fact through PostgreSQL? i.e. how can I make the server ignore all rows after each row when calculating for that row?

I am close to a query that gets the right data, however it is without the above mentioned optimization. Here’s what I have:

SELECT bundesland, startdate, COUNT(time_range) FILTER (WHERE time_range @> startdate::timestamp) OVER (PARTITION BY bundesland)
FROM logs_bl_sj_timerange

where logs_bl_sj_timerange is logs_bl_sj from above, but with an added column time_range that is just tsrange with [startdate, enddate).

The COUNT just returns the number of time_ranges in the bundesland… I expected the number of time ranges in that bundesland which contain startdate.

Bonus question: Would it be better to do this procedurally e.g. in Python? Iterating over the ordered startdates, one could keep a running count that changes based on a stored array of enddates… whereas PostgreSQL will have to start a fresh count for each row…

2

Answers


  1. You can solve this problem by using window functions and appropriate sorting, without the need for iteration in Python. Since your table is already sorted by startdate, you can use the LAG window function to obtain the enddate of the previous row and then check if the startdate of the current row is within this range.
    As for your additional questions, in general, using SQL (especially when its built-in optimizations and features can be utilized) is more efficient than iterating in Python. This is because database management systems (such as PostgreSQL) are often highly optimized for this type of data operation. Of course, this also depends on the specific situation, including data volume, database configuration, and indexing. In this specific situation, I believe using SQL is a more appropriate choice.

    Login or Signup to reply.
  2. There are multiple ways to solve this and using lateral is one of them. Its effectiveness depends on the data (how selective given a bundesland + startdate combination is):

    select *
    from sampleData s1,
         lateral (select count(*)
                  from sampleData s2
                  where s1.bundesland = s2.bundesland
                    and s1.startdate >= s2.startdate
                    and s1.startdate < s2.enddate) as cnt;
    

    As per the bonus question, I don’t think Python would be faster. You can try it yourself on a large data set. And if you have a large data set, doing this with some extension might be much better (ie: timescaledb).

    DBFiddle demo

    EDIT: Note that postgreSQL would utilize an index on startDate pretty well.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search