skip to Main Content

Data Definition

I have a status table like so, attributes not relevant to the question are omitted:

id created value
1 2024-06-24T13:01:00 error
2 2024-06-24T13:02:00 ok
3 2024-06-24T13:03:00 warning
4 2024-06-24T13:04:00 error
5 2024-06-24T13:05:00 error
6 2024-06-24T13:05:30 error
7 2024-06-24T13:06:00 ok
8 2024-06-24T13:07:00 error
9 2024-06-24T13:07:30 error
10 2024-06-24T13:08:00 warning
11 2024-06-24T13:09:00 error

Task at Hand

I’d like to collapse the table into a bloc-like view, where the "error" blocs (1, 4-6, 8-9, 11) are collapsed into a single line, but with the respective before and after states and timestamps also included like so:

error_first_occurance value_before timestamp_before value_after timestamp_after
2024-06-24T13:01:00 NULL NULL ok 2024-06-24T13:02:00
2024-06-24T13:04:00 warning 2024-06-24T13:03:00 ok 2024-06-24T13:06:00
2024-06-24T13:07:00 ok 2024-06-24T13:06:00 warning 2024-06-24T13:08:00
2024-06-24T13:09:00 warning 2024-06-24T13:08:00 NULL NULL

Solution Options

As far as I know, I’d have the following options:

1. Create sub queries

SELECT 
  value AS "value_before"
  -- created AS "timestamp_before"
FROM t AS t1 
WHERE t1.value != 'error' AND t1.created < t.created 
ORDER BY t1.created DESC 
LIMIT 1
SELECT 
  value AS "value_after"
  -- created AS "timestamp_after"
FROM t AS t2 
WHERE t2.value != 'error' AND t2.created > t.created 
ORDER BY t2.created ASC
LIMIT 1

2. LATERAL JOIN

Using a lateral JOIN would presumably save half of the queries, as I could extract two fields at the same time (value, created) which is not possible with the sub queries.

However, the engine would likely need to execute the sorting again for each row that the main query produces. Therefore, I have not pursued this further.

3. SELF JOINs

Here, I’d create two derived tables t1 and t2 (CTE in this case would be the same thing), with sorting by created DESC (newest first) and ASC (oldest first) and join them on t1.value != 'error' AND t1.created < t.created for the "newest that is not an error and older that the t.created", and on t2.value != 'error' AND t2.created > t.created for the "oldest that is not an error and younger that the t.created".

SELECT
  t.created  "error_first_occurance",
  t1.value   "value_before",
  t1.created "timestamp_before",
  t2.value   "value_after",
  t2.created "timestamp_after"
FROM t LEFT
JOIN (
  SELECT value, created
  FROM t WHERE value != 'error'
  ORDER BY created DESC
) t1 ON t1.created < t.created LEFT
JOIN (
  SELECT value, created
  FROM t WHERE value != 'error'
  ORDER BY created ASC
) t2 ON t2.created > t.created
WHERE 
  t.value = 'error'
ORDER BY 
  t.created

This already yields the correct "superset", but as I cannot rely on t1.value or t2.value having only the same values between blocs (see line 23), I don’t know how to tell the DBMS that I want the MAX()/MIN() value from the created timestamp, and the same value from that record.

Question

The challenge seems to be that I not only need the created timestamp field, which could be easily extracted with aggregation functions, but I also need the string value from the same record, which is not accessible to the aggregation functions.

Therefore, on a presumed real table with many 100ks of records:

  • What would be the preferred SQL code to generate the result table?
  • Which indices would I need to create to have optimum speed.

2

Answers


  1. Each error block requires information from three rows, so the most straight-forward way is to use two steps where you each combine two. You can do this either with self-joins or with window functions such as lag(). The rest is just getting the rows right. Here’s an implementation with self-joins, which seems preferable to me (more concise):

    WITH a AS (
        SELECT 
            ROW_NUMBER() OVER (ORDER BY COALESCE(t1.id, -1)),
            t1.value AS value_before,
            t1.created AS timestamp_before,
            t2.created AS error_first_occurrence,
            t2.value,
            t2.created
        FROM
            t t1
            FULL JOIN t t2 ON t1.id+1 = t2.id
        WHERE
            (t1.value = 'error' AND t2.value IS DISTINCT FROM 'error')
            OR (t1.value IS DISTINCT FROM 'error' AND t2.value = 'error')
    )
    SELECT
        a1.error_first_occurrence,
        a1.value_before,
        a1.timestamp_before,
        a2.value AS value_after,
        a2.created AS timestamp_after
    FROM
        a a1
        JOIN a a2 ON a1.row_number+1 = a2.row_number
    WHERE
        a1.value = 'error';
    

    Note the use of IS DISTINCT and COALESCE to deals with the NULLs.

    I don’t think there will be much of a difference performance-wise between this and similar approaches (such as using window functions), but if it’s important enough, you should definitely try some variations, sometimes the PostgreSQL planner gets weird ideas.

    A different approach would be the to use a recursive query, but I don’t think that would get you anything in this case (except even more obscure syntax). They are, however, the most general and powerful tool to deal with this kind of problems (aggregating over groups of sequential rows), so if you keep running into something like this, you might want to have a look.

    Concerning indexes, I assume that id is the primary key, so the most important index is already there. Other than that, a multicolumn index on id and value might help speed up the initial self-join, but only if your error and error-free blocks are rather large, as otherwise most of the rows of the table will have to be scanned anyway.

    Login or Signup to reply.
  2. That’s a gaps-and-island problem. As already pointed out by @Knoep, you can use window functions:

    select distinct on(island_n)
           first_value(created)         over w2 as error_first_occurance
          ,first_value(value_before)    over w2 as value_before
          ,first_value(timestamp_before)over w2 as timestamp_before
          ,last_value(value_after)      over w2 as value_after
          ,last_value(timestamp_after)  over w2 as timestamp_after
    from(select*,count(*)filter(where value<>'error')over w1 as island_n 
                ,value='error'
                 and value is distinct from lag(value)over w1 as is_starting_island
                ,lag(value)   over w1 as value_before
                ,lag(created) over w1 as timestamp_before
                ,lead(value)  over w1 as value_after
                ,lead(created)over w1 as timestamp_after
         from t window w1 as(order by created) )_
    where value='error'
    window w2 as(partition by island_n order by created
                 rows between unbounded preceding and unbounded following)
    order by island_n,1;
    

    It runs through your table finding islands (uninterrupted sequences of error) and collects values from their preceding and following rows. After that, it takes one set of first/last values per island using distinct on.

    It’s about as fast as the quadruple self-join but it doesn’t have to rely on the two quiet assumptions that the id column has no gaps and that it follows the same order as createdthat one fails if any of the two doesn’t hold. The window-over-window also doesn’t require the additional effort of maintaining the pre-sorted, gapless id.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search