skip to Main Content

I thought I understood GENERATE_SERIES(), but obviously not as well as I thought.

I have a table:

CREATE TABLE status
(
  start_tz  TIMESTAMPTZ NOT NULL,
  end_tz    TIMESTAMPTZ NOT NULL,
  ms        TEXT        NOT NULL,
  sensor_id SMALLINT    NOT NULL,
);

And I wish to fill this table with simulated data from 2022-01-01 00:00:00 in 20 minute intervals for 4 sensors (numbered 1 – 4) for 2 hours (or whatever – no hardcoding). The ms text is random – see below.

I have tried various combinations of

SELECT
  GENERATE_SERIES('2022-01-01 00:00:00', '2022-01-01 02:00:00', (INTERVAL '20 MINUTES')),
  GENERATE_SERIES('2022-01-01 00:20:00', '2022-01-01 02:20:00', (INTERVAL '20 MINUTES')),
  CASE
     WHEN random() > 0.5 THEN 'in_motion)'
    ELSE                     'stationary'
  END,
  GENERATE_SERIES(1, 4);

and just can’t get it. I also have various cominations and permutations of FROM GENERATE_SERIES(1, ... whatever guess...); and I’m stumped.

What I want looks like this:

start_tz                 end_tz                 ms      sensor_id
2022-01-01 00:00:00+00  2022-01-01 00:20:00+00  stationary  1
2022-01-01 00:20:00+00  2022-01-01 00:40:00+00  stationary  1
2022-01-01 00:40:00+00  2022-01-01 01:00:00+00  in_motion)  1
2022-01-01 01:00:00+00  2022-01-01 01:20:00+00  in_motion)  1
2022-01-01 01:20:00+00  2022-01-01 01:40:00+00  stationary  1
2022-01-01 01:40:00+00  2022-01-01 02:00:00+00  in_motion)  1
2022-01-01 02:00:00+00  2022-01-01 02:20:00+00  stationary  1
2022-01-01 00:00:00+00  2022-01-01 00:20:00+00  stationary  2
2022-01-01 00:20:00+00  2022-01-01 00:40:00+00  stationary  2
2022-01-01 00:40:00+00  2022-01-01 01:00:00+00  in_motion)  2
...
...
and so on

A fiddle is available here. The correct answer will give some explanation of where I was going wrong – I don’t want to be given a fish, I would like to be taught how to do a little fishing of my own!

2

Answers


  1. The behavior you want would have been produced in 9.6, but perhaps only by accident. Back then multiple SRF in the select list would be cycled until all of them were exhausted at the same time, returning a number of rows which was the Least Common Multiple of each isolated row counts. Because 7 is prime, this would do what you apparently want. This was changed in v10, as a known side effect of other refactoring I think, so that each SRF was cycle just once with the "short" ones being augmented with NULLs.

    To get what you want you can put one of the SRF into the FROM list:

    SELECT
      GENERATE_SERIES('2022-01-01 00:00:00', '2022-01-01 02:00:00', (INTERVAL '20 MINUTES')),
      GENERATE_SERIES('2022-01-01 00:20:00', '2022-01-01 02:20:00', (INTERVAL '20 MINUTES')),
      CASE
         WHEN random() > 0.5 THEN 'in_motion)'
        ELSE                     'stationary'
      END, 
      f 
      FROM 
      GENERATE_SERIES(1, 4) f(f);
    

    But I think it is still discouraged to put SRF in the SELECT-list. The problem is that putting both of the time ones in the FROM-list would return 7*7*4 rows, not LCM(7,7,4) rows. To get around that, you would reduce it to one SRF for the time, and use an expression to get the parallel offset time point.

    SELECT
          s,
          s+INTERVAL '20 MINUTES',
          CASE
             WHEN random() > 0.5 THEN 'in_motion)'
            ELSE                     'stationary'
          END, 
          f 
          FROM GENERATE_SERIES('2022-01-01 00:00:00', '2022-01-01 02:00:00', (INTERVAL '20 MINUTES')) s(s), 
          GENERATE_SERIES(1, 4) f(f);
    

    So this is the way I would do it. It is also less repetitive, repeating only the interval rather than the offsat start, offsat stop, and interval.

    Login or Signup to reply.
    • GENERATE_SERIES returns a table, so to get rows from it you should only use it in the FROM clause of a query, not the SELECT part.
      • The generated table has a single column named "generate_series" – because that’s also the name of the function itself it’s a good idea to escape the column name (with double-quotes) to avoid potential trouble.
    • Because your end_tz column values are always 20-minutes ahead of start_tz you don’t (and shouldn’t) use GENERATE_SERIES again, instead define a new query column that simply adds 20 minutes to the already-generated start_tz value.
    • Multiplying the generated rows for each sensor_id can be done by re-querying that generated data 4 times over with UNION.
    • Finally, add the 'in_motion'/stationary' column in the last query step.

    Like so:

    WITH twentyMinuteIntervals AS (
        SELECT
            "generate_series" AS d0,
            ( "generate_series" + INTERVAL '20 MINUTES' ) AS d1
        FROM
            GENERATE_SERIES( '2022-01-01 00:00:00', '2022-01-01 02:00:00', INTERVAL '20 MINUTES' )
    ),
    multiplyRowsBy4 AS (
        SELECT d0, d1, 1 AS "sensor_id" FROM twentyMinuteIntervals
        UNION
        SELECT d0, d1, 2 AS "sensor_id" FROM twentyMinuteIntervals
        UNION
        SELECT d0, d1, 3 AS "sensor_id" FROM twentyMinuteIntervals
        UNION
        SELECT d0, d1, 4 AS "sensor_id" FROM twentyMinuteIntervals
    ),
    withRandom AS (
        SELECT
            d0 AS "start_tz",
            d1 AS "end_tz",
            "sensor_id",
            CASE WHEN random() > 0.5 THEN 'in_motion' ELSE 'stationary' END AS "status"
        FROM
            multiplyRowsBy4
    )
    SELECT
        *
    FROM
        withRandom
    ORDER BY
        "start_tz",
        "sensor_id";
    

    Here’s the DBFiddle, screenshot below:

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search