Postgresql - How to efficiently aggregate rows into arrays, but preserve `NULL` to indicate missing data?

smarr
March 9, 2024
182 views
0 votes
2 Answers

I currently have a database schema like the following table:

CREATE TABLE Measures(
    expId SERIAL,
    iteration INT NOT NULL,
    value float4 NOT NULL,
    PRIMARY KEY(expId, iteration)
);

So, a table of various measurements, repeated for n iterations.
Though, because we have more data than originally expected, I want to move to a new table layout that instead uses an array column, which overall gives better performance (already tested and benchmarked):

CREATE TABLE TmpMeasures(
    expId SERIAL PRIMARY KEY,
    values float4[] NOT NULL
);

My problem now is how to get the old data into the new format.
The data may look something like this. Not that iterations don’t
always produce all data, so there may be NULL values in the final array:

INSERT INTO Measures (expId, iteration, value)
VALUES
  (1, 1, 1.1), (1, 2, 1.2), (1, 3, 1.3),
  (2, 1, 2.1),                           (2, 4, 2.4),
  (3, 1, 3.1),                           (3, 4, 3.4);

And conversion could be done with a two step process, roughly like this, to first create the array for an experiment, and then populate the iteration values:

INSERT INTO TmpMeasures(expId, values)
  SELECT expId, '{}'::float4[] 
  FROM Measures
  ON CONFLICT DO NOTHING;

UPDATE TmpMeasures tm
  SET values[iteration] = m.value
  FROM Measures m WHERE tm.expId = m.expId;

Though, my problem now is that the UPDATE actually only ever seems to take the first iteration, i.e., iteration = 1.
I am not quite understanding why that is the case.

I suspect, alternative approaches to values[iteration] would try to group by expId, and order by iteration and aggregate that into an array.

Unfortunately, the data isn’t perfect, but iterations should line up.

So, the following seems to work, but it’s extremely slow, and I don’t quite understand why it’s needed in the first place.

DO
$do$
BEGIN 
   FOR i IN 1..(SELECT max(iteration) FROM Measures m) LOOP
      UPDATE TmpMeasures tm
      SET values[i] = m.value
      FROM Measures m
      WHERE
        tm.expId = m.expId AND
        m.iteration=i;
   END LOOP;
END
$do$;

Why does the "normal" update statement not suffice?
Answer: thanks to @Zegarek’s pointer below, this is indeed expected behavior of UPDATE, which only takes a single row from the FROM clause.

So, the more relevant question for me at this point is whether there’s a better approach for the data conversion, that produces for instance [3.1, NULL, NULL, 3.4] for expId=3.

Tags: arrays postgresql

Answers

Just select the values you want to be inserted:

INSERT INTO TmpMeasures
SELECT m1.expid, ARRAY[m1.value, m2.value, m3.value,m4.value] as a
FROM (SELECT distinct expid FROM Measures) m0
LEFT JOIN Measures m1 ON m1.expid = m0.expid and m1.iteration = 1
LEFT JOIN Measures m2 ON m2.expid = m0.expid and m2.iteration = 2
LEFT JOIN Measures m3 ON m3.expid = m0.expid and m3.iteration = 3
LEFT JOIN Measures m4 ON m4.expid = m0.expid and m4.iteration = 4
ORDER BY expid

Start selecting the unique expids, using SELECT distinct expid FROM Measures because not every iteration has all values.

see: DBFIDDLE

output:

expid	values
1	{1.1,1.2,1.3,NULL}
2	{2.1,NULL,NULL,2.4}
3	{3.1,NULL,NULL,3.4}

EDIT: An alternative solution, because you might have more than 4 iterations.

INSERT INTO TmpMeasures
WITH nrs as (
  SELECT * 
  from generate_series(1,(select max(iteration) from Measures)) as f(n)
  cross join (select distinct expid from Measures)
)
SELECT nrs.expid, ARRAY_AGG(m.value) as a
FROM nrs
LEFT JOiN Measures m on m.expid=nrs.expid and m.iteration=nrs.n
GROUP BY nrs.expid
order by nrs.expid

see: DBFIDDLE

- Zegarek
- March 9, 2024 at 1:28 pm
- 0 votes
0
1. Get min/max iterations to later generate_series() for everything in between.
2. Grab distinct dExpId to left join it with the iterations to get null for missing values,
3. Then array_agg(values order by i).
4. Instead of update, you can create table as and add the constraints later: demo
```
create table if not exists TmpMeasures as
with min_max_iteration as (
  select min(iteration), 
         max(iteration) from Measures)
,distinct_expIds as (
  select distinct expId from Measures)
select expId,
       array_agg(value order by iteration) as "values"
from min_max_iteration
cross join distinct_expIds
cross join generate_series(min,max) g(iteration)
left join Measures using(iteration,ExpId)
group by ExpId;
```
expId values

1 {1.1,1.2,1.3,NULL}

2 {2.1,NULL,NULL,2.4}

3 {3.1,NULL,NULL,3.4}

Adding constraints afterwards: (identity columns are encouraged over serial).
```
select max(expId)+1 from TmpMeasures;--4

alter table TmpMeasures 
   add primary key (expId)
  ,alter column expId add generated by default as identity(start with 4)
  ,alter column values set not null;
```
The select is needed before that because you can’t run it as a scalar subselect in alter table. The demo also shows how to inspect each step, and how to run it as an update instead.

If you have some iterations that are missing from all concurrent Measures and you want to skip that field in all arrays, you can use where i in () to filter the generate_series() g(i).

If you were looking for performance, on 28k samples (60 expIds with 800 iterations each, 40% chance to miss an iteration) this takes 0.25s: performance demo. Looped PL/pgSQL update needs 15.80s. In v16, the difference increases to 0.08s vs running out of resources and failing.

The UPDATE t SET arr[m.k]=m.v FROM m WHERE m.id=t.id idea

As mentioned in the comment, it’s update‘s fault it won’t work:

A target row shouldn’t join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.

Even though the subscript addresses a specific element in the array value, update will not attempt to use more than one undetermined row matched from the source – in effect, for each id, only one (key,value) pair arrives to be applied, the rest is discarded.

Without order by all things are unordered, and update doesn’t even offer an order by. If this worked, you’d see anomalies resulting from the unpredictable order the multiple row updates would be applied to the same row. In your case each row update targets a different index in the array value so that wouldn’t matter as they wouldn’t interfere with each other, but that’s not always the case.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – How to efficiently aggregate rows into arrays, but preserve `NULL` to indicate missing data?

Answers

The `UPDATE t SET arr[m.k]=m.v FROM m WHERE m.id=t.id` idea

Postgresql – How to efficiently aggregate rows into arrays, but preserve `NULL` to indicate missing data?

Answers

The UPDATE t SET arr[m.k]=m.v FROM m WHERE m.id=t.id idea

The `UPDATE t SET arr[m.k]=m.v FROM m WHERE m.id=t.id` idea