In Polars / pandas I can do a rolling sum where row each row the window is (row - 10 minutes, row]
. For example:
import polars as pl
data = {
"timestamp": [
"2023-08-04 10:00:00",
"2023-08-04 10:05:00",
"2023-08-04 10:10:00",
"2023-08-04 10:10:00",
"2023-08-04 10:20:00",
"2023-08-04 10:20:00",
],
"value": [1, 2, 3, 4, 5, 6],
}
df = pl.DataFrame(data).with_columns(pl.col("timestamp").str.strptime(pl.Datetime))
print(
df.with_columns(pl.col("value").rolling_sum_by("timestamp", "10m", closed="right"))
)
This outputs
shape: (6, 2)
┌─────────────────────┬───────┐
│ timestamp ┆ value │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪═══════╡
│ 2023-08-04 10:00:00 ┆ 1 │
│ 2023-08-04 10:05:00 ┆ 3 │
│ 2023-08-04 10:10:00 ┆ 9 │
│ 2023-08-04 10:10:00 ┆ 9 │
│ 2023-08-04 10:20:00 ┆ 11 │
│ 2023-08-04 10:20:00 ┆ 11 │
└─────────────────────┴───────┘
How can I do this in DuckDB? Closest I could come up with is:
rel = duckdb.sql("""
SELECT
timestamp,
value,
SUM(value) OVER roll AS rolling_sum
FROM df
WINDOW roll AS (
ORDER BY timestamp
RANGE BETWEEN INTERVAL 10 minutes PRECEDING AND CURRENT ROW
)
ORDER BY timestamp;
""")
print(rel)
but that makes the window [row - 10 minutes, row]
, not (row - 10 minutes, row]
Alternatively, I could do
rel = duckdb.sql("""
SELECT
timestamp,
value,
SUM(value) OVER roll AS rolling_sum
FROM df
WINDOW roll AS (
ORDER BY timestamp
RANGE BETWEEN INTERVAL '10 minutes' - INTERVAL '1 microsecond' PRECEDING AND CURRENT ROW
)
ORDER BY timestamp;
""")
but I’m not sure about how robust that’d be?
2
Answers
Maybe not particularly neat, but from the top of my head you could exclude the rows which are exactly 10 minutes back by additional window clause
DuckDB window operator maintainer here. Your second solution will do what you want, and it is unfortunate that the standard does not have a way to specify this (although it sounds like a useful extension). Possibly of interest, we do support
EXCLUDE CURRENT ROW
, which would give you a half-open interval on the right.I’m not sure what you mean by "robust", but I’m guessing you are concerned about timestamp accuracy changing under you, and that should not be an issue. We do have multiple precision types, but they will all get converted internally to µs precision for windowing operations. Changing the internal timestamp precision would invalidate every database out there, and that is not going to happen.
For performance, there should also be no difference as they both use constants that get evaluated at query compile time. On the other hand, adding a second window function will definitely be twice as slow.