Data Definition
I have a status table like so, attributes not relevant to the question are omitted:
id | created | value |
---|---|---|
1 | 2024-06-24T13:01:00 | error |
2 | 2024-06-24T13:02:00 | ok |
3 | 2024-06-24T13:03:00 | warning |
4 | 2024-06-24T13:04:00 | error |
5 | 2024-06-24T13:05:00 | error |
6 | 2024-06-24T13:05:30 | error |
7 | 2024-06-24T13:06:00 | ok |
8 | 2024-06-24T13:07:00 | error |
9 | 2024-06-24T13:07:30 | error |
10 | 2024-06-24T13:08:00 | warning |
11 | 2024-06-24T13:09:00 | error |
Task at Hand
I’d like to collapse the table into a bloc-like view, where the "error" blocs (1
, 4-6
, 8-9
, 11
) are collapsed into a single line, but with the respective before and after states and timestamps also included like so:
error_first_occurance | value_before | timestamp_before | value_after | timestamp_after |
---|---|---|---|---|
2024-06-24T13:01:00 | NULL | NULL | ok | 2024-06-24T13:02:00 |
2024-06-24T13:04:00 | warning | 2024-06-24T13:03:00 | ok | 2024-06-24T13:06:00 |
2024-06-24T13:07:00 | ok | 2024-06-24T13:06:00 | warning | 2024-06-24T13:08:00 |
2024-06-24T13:09:00 | warning | 2024-06-24T13:08:00 | NULL | NULL |
Solution Options
As far as I know, I’d have the following options:
1. Create sub queries
SELECT
value AS "value_before"
-- created AS "timestamp_before"
FROM t AS t1
WHERE t1.value != 'error' AND t1.created < t.created
ORDER BY t1.created DESC
LIMIT 1
SELECT
value AS "value_after"
-- created AS "timestamp_after"
FROM t AS t2
WHERE t2.value != 'error' AND t2.created > t.created
ORDER BY t2.created ASC
LIMIT 1
2. LATERAL JOIN
Using a lateral JOIN would presumably save half of the queries, as I could extract two fields at the same time (value
, created
) which is not possible with the sub queries.
However, the engine would likely need to execute the sorting again for each row that the main query produces. Therefore, I have not pursued this further.
3. SELF JOINs
Here, I’d create two derived tables t1
and t2
(CTE in this case would be the same thing), with sorting by created
DESC
(newest first) and ASC
(oldest first) and join them on t1.value != 'error' AND t1.created < t.created
for the "newest that is not an error and older that the t.created", and on t2.value != 'error' AND t2.created > t.created
for the "oldest that is not an error and younger that the t.created".
SELECT
t.created "error_first_occurance",
t1.value "value_before",
t1.created "timestamp_before",
t2.value "value_after",
t2.created "timestamp_after"
FROM t LEFT
JOIN (
SELECT value, created
FROM t WHERE value != 'error'
ORDER BY created DESC
) t1 ON t1.created < t.created LEFT
JOIN (
SELECT value, created
FROM t WHERE value != 'error'
ORDER BY created ASC
) t2 ON t2.created > t.created
WHERE
t.value = 'error'
ORDER BY
t.created
This already yields the correct "superset", but as I cannot rely on t1.value
or t2.value
having only the same values between blocs (see line 2
–3
), I don’t know how to tell the DBMS that I want the MAX()
/MIN()
value from the created
timestamp, and the same value
from that record.
Question
The challenge seems to be that I not only need the created
timestamp field, which could be easily extracted with aggregation functions, but I also need the string value
from the same record, which is not accessible to the aggregation functions.
Therefore, on a presumed real table with many 100ks of records:
- What would be the preferred SQL code to generate the result table?
- Which indices would I need to create to have optimum speed.
2
Answers
Each error block requires information from three rows, so the most straight-forward way is to use two steps where you each combine two. You can do this either with self-joins or with window functions such as
lag()
. The rest is just getting the rows right. Here’s an implementation with self-joins, which seems preferable to me (more concise):Note the use of
IS DISTINCT
andCOALESCE
to deals with theNULL
s.I don’t think there will be much of a difference performance-wise between this and similar approaches (such as using window functions), but if it’s important enough, you should definitely try some variations, sometimes the PostgreSQL planner gets weird ideas.
A different approach would be the to use a recursive query, but I don’t think that would get you anything in this case (except even more obscure syntax). They are, however, the most general and powerful tool to deal with this kind of problems (aggregating over groups of sequential rows), so if you keep running into something like this, you might want to have a look.
Concerning indexes, I assume that
id
is the primary key, so the most important index is already there. Other than that, a multicolumn index onid
andvalue
might help speed up the initial self-join, but only if your error and error-free blocks are rather large, as otherwise most of the rows of the table will have to be scanned anyway.That’s a gaps-and-island problem. As already pointed out by @Knoep, you can use window functions:
It runs through your table finding islands (uninterrupted sequences of
error
) and collects values from their preceding and following rows. After that, it takes one set of first/last values per island usingdistinct on
.It’s about as fast as the quadruple self-
join
but it doesn’t have to rely on the two quiet assumptions that theid
column has no gaps and that it follows the same order ascreated
– that one fails if any of the two doesn’t hold. The window-over-window also doesn’t require the additional effort of maintaining the pre-sorted, gaplessid
.