skip to Main Content

In Polars / pandas I can do a rolling sum where row each row the window is (row - 10 minutes, row]. For example:

import polars as pl

data = {
    "timestamp": [
        "2023-08-04 10:00:00",
        "2023-08-04 10:05:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:20:00",
        "2023-08-04 10:20:00",
    ],
    "value": [1, 2, 3, 4, 5, 6],
}

df = pl.DataFrame(data).with_columns(pl.col("timestamp").str.strptime(pl.Datetime))

print(
    df.with_columns(pl.col("value").rolling_sum_by("timestamp", "10m", closed="right"))
)

This outputs

shape: (6, 2)
┌─────────────────────┬───────┐
│ timestamp           ┆ value │
│ ---                 ┆ ---   │
│ datetime[μs]        ┆ i64   │
╞═════════════════════╪═══════╡
│ 2023-08-04 10:00:00 ┆ 1     │
│ 2023-08-04 10:05:00 ┆ 3     │
│ 2023-08-04 10:10:00 ┆ 9     │
│ 2023-08-04 10:10:00 ┆ 9     │
│ 2023-08-04 10:20:00 ┆ 11    │
│ 2023-08-04 10:20:00 ┆ 11    │
└─────────────────────┴───────┘

How can I do this in DuckDB? Closest I could come up with is:

rel = duckdb.sql("""
SELECT
  timestamp,
  value,
  SUM(value) OVER roll AS rolling_sum
FROM df
WINDOW roll AS (
  ORDER BY timestamp
  RANGE BETWEEN INTERVAL 10 minutes PRECEDING AND CURRENT ROW
)
ORDER BY timestamp;
""")
print(rel)

but that makes the window [row - 10 minutes, row], not (row - 10 minutes, row]

Alternatively, I could do

rel = duckdb.sql("""
SELECT
  timestamp,
  value,
  SUM(value) OVER roll AS rolling_sum
FROM df
WINDOW roll AS (
  ORDER BY timestamp
  RANGE BETWEEN INTERVAL '10 minutes' - INTERVAL '1 microsecond' PRECEDING AND CURRENT ROW
)
ORDER BY timestamp;
""")

but I’m not sure about how robust that’d be?

2

Answers


  1. Maybe not particularly neat, but from the top of my head you could exclude the rows which are exactly 10 minutes back by additional window clause

    import duckdb
    
    rel = duckdb.sql("""
    SELECT
      timestamp,
      value,
      SUM(value) OVER roll - coalesce(SUM(value) OVER exclude, 0) AS rolling_sum
    FROM df
    WINDOW roll AS (
      ORDER BY timestamp
      RANGE BETWEEN INTERVAL 10 minutes PRECEDING AND CURRENT ROW
    ), exclude AS (
      ORDER BY timestamp
      RANGE BETWEEN INTERVAL 10 minutes PRECEDING AND INTERVAL 10 minutes PRECEDING
    )
    ORDER BY timestamp;
    """)
    print(rel)
    
    ┌─────────────────────┬───────┬─────────────┐
    │      timestamp      │ value │ rolling_sum │
    │      timestamp      │ int64 │   int128    │
    ├─────────────────────┼───────┼─────────────┤
    │ 2023-08-04 10:00:00 │     1 │           1 │
    │ 2023-08-04 10:05:00 │     2 │           3 │
    │ 2023-08-04 10:10:00 │     3 │           9 │
    │ 2023-08-04 10:10:00 │     4 │           9 │
    │ 2023-08-04 10:20:00 │     5 │          11 │
    │ 2023-08-04 10:20:00 │     6 │          11 │
    └─────────────────────┴───────┴─────────────┘
    
    Login or Signup to reply.
  2. DuckDB window operator maintainer here. Your second solution will do what you want, and it is unfortunate that the standard does not have a way to specify this (although it sounds like a useful extension). Possibly of interest, we do support EXCLUDE CURRENT ROW, which would give you a half-open interval on the right.

    I’m not sure what you mean by "robust", but I’m guessing you are concerned about timestamp accuracy changing under you, and that should not be an issue. We do have multiple precision types, but they will all get converted internally to µs precision for windowing operations. Changing the internal timestamp precision would invalidate every database out there, and that is not going to happen.

    For performance, there should also be no difference as they both use constants that get evaluated at query compile time. On the other hand, adding a second window function will definitely be twice as slow.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search