skip to Main Content

I’m working with event sourced data where all of the important fields are combined to a JSONB column and many of the keys are missing from most of the database rows.

I would want to get:

  1. Aggregated combined values of included arrays in the JSONB field (see ingredients in the example)
  2. Latest non-null value according to timestamp

I tried this on my own and I was able to produce a example which generates exactly what I would want to achieve here but it looks rather ugly and I’m wondering how to make following query better.

Schema (PostgreSQL v15)

CREATE TABLE events (
    id SERIAL PRIMARY KEY,
    identifier VARCHAR(255),
    timestamp TIMESTAMP WITH TIME ZONE,
    event_data JSONB
);
INSERT INTO events (identifier, timestamp, event_data)
VALUES
    ('12345', '2019-01-01T00:00:00.000Z', '{"target": "99999999"}'),
    ('12345', '2019-01-01T12:00:00.000Z', '{"ingredients": ["Banana", "Strawberry"]}'),
    ('12345', '2019-01-03T00:00:00.000Z', '{"target": "12345678", "user": "peterpan"}'),
    ('12345', '2019-01-04T00:00:00.000Z', '{"ingredients": ["Melon"], "user": "robinhood"}'),
    ('67890', '2019-01-03T00:00:00.000Z', '{"target": "0000", "user": "mickeymouse"}'),
    ('67890', '2019-01-04T00:00:00.000Z', '{"ingredients": ["Potato"]}');

Query #1

WITH events_flattened AS (
    SELECT
        identifier,
        timestamp,
        event_data->>'target' AS target,
        event_data->>'user' AS user,
        elem.part
    FROM events
    LEFT JOIN LATERAL jsonb_array_elements(event_data->'ingredients') elem(part) ON true
    ORDER BY timestamp DESC
)

SELECT
    identifier,
    (ARRAY_REMOVE(ARRAY_AGG(e.target),NULL))[1] as target,
    (ARRAY_REMOVE(ARRAY_AGG(e.user),NULL))[1] as user,
    ARRAY_REMOVE(ARRAY_AGG(part),NULL) as ingredients,
    MAX(timestamp) as latest_update
FROM events_flattened e
GROUP BY identifier;

For the answer to be helpful it should produce exactly same result as seen on this table:

identifier target user ingredients latest_update
12345 12345678 robinhood Melon,Strawberry,Banana 2019-01-04T00:00:00.000Z
67890 0000 mickeymouse Potato 2019-01-04T00:00:00.000Z

View on DB Fiddle

I’m trying to determine which kind of query and indexes would be beneficial to get exactly this kind of data out from this table?

2

Answers


  1. Chosen as BEST ANSWER

    Here's the query I ended up using here. Using the subqueries was something my colleague came up with:

    
    SELECT
      e.identifier,
      (
        SELECT e2.event_data->>'target'
        FROM events e2
        WHERE e2.identifier=e.identifier
        AND e2.event_data->>'target' IS NOT NULL
        ORDER by timestamp DESC LIMIT 1
      ) AS target,
      (
        SELECT e2.event_data->>'user'
        FROM events e2
        WHERE e2.identifier=e.identifier
        AND e2.event_data->>'user' IS NOT NULL
        ORDER by timestamp DESC LIMIT 1
      ) AS user,
      ARRAY(
        SELECT jsonb_array_elements(e2.event_data->'ingredients')
        FROM events e2
        WHERE e2.identifier=e.identifier
        AND e2.event_data->>'ingredients' IS NOT NULL
      ) AS ingredients,
      MAX(timestamp) AS latest_update
    FROM events e
    GROUP BY identifier;
    

  2. CREATE TABLE events (
        id SERIAL PRIMARY KEY,
        identifier VARCHAR(255),
        timestamp TIMESTAMP WITH TIME ZONE,
        event_data JSONB
    );
    INSERT INTO events (identifier, timestamp, event_data)
    VALUES
        ('12345', '2019-01-01T00:00:00.000Z', '{"target": "99999999"}'),
        ('12345', '2019-01-01T12:00:00.000Z', '{"ingredients": ["Banana", "Strawberry"]}'),
        ('12345', '2019-01-03T00:00:00.000Z', '{"target": "12345678", "us": "peterpan"}'),
        ('12345', '2019-01-04T00:00:00.000Z', '{"ingredients": ["Melon"], "us": "robinhood"}'),
        ('67890', '2019-01-03T00:00:00.000Z', '{"target": "0000", "us": "mickeymouse"}'),
        ('67890', '2019-01-04T00:00:00.000Z', '{"ingredients": ["Potato"]}');
    
    WITH events_flattened AS (
        SELECT
            identifier,
            timestamp,
            event_data->>'target' AS target,
            event_data->>'us' AS us,
            elem.part
        FROM events
        LEFT JOIN LATERAL jsonb_array_elements(event_data->'ingredients') elem(part) ON true
    ),
    
    latest_target AS (
        SELECT DISTINCT ON (identifier) 
            identifier, 
            target
        FROM events_flattened
        WHERE target IS NOT NULL
        ORDER BY identifier, timestamp DESC
    ),
    
    latest_user AS (
        SELECT DISTINCT ON (identifier) 
            identifier, 
            us
        FROM events_flattened
        WHERE us IS NOT NULL
        ORDER BY identifier, timestamp DESC
    )
    
    SELECT 
        e.identifier,
        t.target,
        u.us,
        ARRAY_REMOVE(ARRAY_AGG(e.part), NULL) AS ingredients,
        MAX(e.timestamp) AS latest_update
    FROM events_flattened e
    LEFT JOIN latest_target t ON e.identifier = t.identifier
    LEFT JOIN latest_user u ON e.identifier = u.identifier
    GROUP BY e.identifier, t.target, u.us;
    
    
    identifier target us ingredients latest_update
    12345 12345678 robinhood {""Banana"",""Strawberry"",""Melon""} 2019-01-04 00:00:00+00
    67890 0000 mickeymouse {""Potato""} 2019-01-04 00:00:00+00

    fiddle

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search