skip to Main Content

I currently have a database schema like the following table:

CREATE TABLE Measures(
    expId SERIAL,
    iteration INT NOT NULL,
    value float4 NOT NULL,
    PRIMARY KEY(expId, iteration)
);

So, a table of various measurements, repeated for n iterations.
Though, because we have more data than originally expected, I want to move to a new table layout that instead uses an array column, which overall gives better performance (already tested and benchmarked):

CREATE TABLE TmpMeasures(
    expId SERIAL PRIMARY KEY,
    values float4[] NOT NULL
);

My problem now is how to get the old data into the new format.
The data may look something like this. Not that iterations don’t
always produce all data, so there may be NULL values in the final array:

INSERT INTO Measures (expId, iteration, value)
VALUES
  (1, 1, 1.1), (1, 2, 1.2), (1, 3, 1.3),
  (2, 1, 2.1),                           (2, 4, 2.4),
  (3, 1, 3.1),                           (3, 4, 3.4);

And conversion could be done with a two step process, roughly like this, to first create the array for an experiment, and then populate the iteration values:

INSERT INTO TmpMeasures(expId, values)
  SELECT expId, '{}'::float4[] 
  FROM Measures
  ON CONFLICT DO NOTHING;

UPDATE TmpMeasures tm
  SET values[iteration] = m.value
  FROM Measures m WHERE tm.expId = m.expId;

Though, my problem now is that the UPDATE actually only ever seems to take the first iteration, i.e., iteration = 1.
I am not quite understanding why that is the case.

I suspect, alternative approaches to values[iteration] would try to group by expId, and order by iteration and aggregate that into an array.

Unfortunately, the data isn’t perfect, but iterations should line up.

So, the following seems to work, but it’s extremely slow, and I don’t quite understand why it’s needed in the first place.

DO
$do$
BEGIN 
   FOR i IN 1..(SELECT max(iteration) FROM Measures m) LOOP
      UPDATE TmpMeasures tm
      SET values[i] = m.value
      FROM Measures m
      WHERE
        tm.expId = m.expId AND
        m.iteration=i;
   END LOOP;
END
$do$;

Why does the "normal" update statement not suffice?
Answer: thanks to @Zegarek’s pointer below, this is indeed expected behavior of UPDATE, which only takes a single row from the FROM clause.

So, the more relevant question for me at this point is whether there’s a better approach for the data conversion, that produces for instance [3.1, NULL, NULL, 3.4] for expId=3.

2

Answers


  1. Just select the values you want to be inserted:

    INSERT INTO TmpMeasures
    SELECT m1.expid, ARRAY[m1.value, m2.value, m3.value,m4.value] as a
    FROM (SELECT distinct expid FROM Measures) m0
    LEFT JOIN Measures m1 ON m1.expid = m0.expid and m1.iteration = 1
    LEFT JOIN Measures m2 ON m2.expid = m0.expid and m2.iteration = 2
    LEFT JOIN Measures m3 ON m3.expid = m0.expid and m3.iteration = 3
    LEFT JOIN Measures m4 ON m4.expid = m0.expid and m4.iteration = 4
    ORDER BY expid
    

    Start selecting the unique expids, using SELECT distinct expid FROM Measures because not every iteration has all values.

    see: DBFIDDLE

    output:

    expid values
    1 {1.1,1.2,1.3,NULL}
    2 {2.1,NULL,NULL,2.4}
    3 {3.1,NULL,NULL,3.4}

    EDIT: An alternative solution, because you might have more than 4 iterations.

    INSERT INTO TmpMeasures
    WITH nrs as (
      SELECT * 
      from generate_series(1,(select max(iteration) from Measures)) as f(n)
      cross join (select distinct expid from Measures)
    )
    SELECT nrs.expid, ARRAY_AGG(m.value) as a
    FROM nrs
    LEFT JOiN Measures m on m.expid=nrs.expid and m.iteration=nrs.n
    GROUP BY nrs.expid
    order by nrs.expid
    

    see: DBFIDDLE

    Login or Signup to reply.
    1. Get min/max iterations to later generate_series() for everything in between.
    2. Grab distinct dExpId to left join it with the iterations to get null for missing values,
    3. Then array_agg(values order by i).
    4. Instead of update, you can create table as and add the constraints later: demo
    create table if not exists TmpMeasures as
    with min_max_iteration as (
      select min(iteration), 
             max(iteration) from Measures)
    ,distinct_expIds as (
      select distinct expId from Measures)
    select expId,
           array_agg(value order by iteration) as "values"
    from min_max_iteration
    cross join distinct_expIds
    cross join generate_series(min,max) g(iteration)
    left join Measures using(iteration,ExpId)
    group by ExpId;
    
    expId values
    1 {1.1,1.2,1.3,NULL}
    2 {2.1,NULL,NULL,2.4}
    3 {3.1,NULL,NULL,3.4}

    Adding constraints afterwards: (identity columns are encouraged over serial).

    select max(expId)+1 from TmpMeasures;--4
    
    alter table TmpMeasures 
       add primary key (expId)
      ,alter column expId add generated by default as identity(start with 4)
      ,alter column values set not null;
    

    The select is needed before that because you can’t run it as a scalar subselect in alter table. The demo also shows how to inspect each step, and how to run it as an update instead.

    If you have some iterations that are missing from all concurrent Measures and you want to skip that field in all arrays, you can use where i in () to filter the generate_series() g(i).

    If you were looking for performance, on 28k samples (60 expIds with 800 iterations each, 40% chance to miss an iteration) this takes 0.25s: performance demo. Looped PL/pgSQL update needs 15.80s. In v16, the difference increases to 0.08s vs running out of resources and failing.


    The UPDATE t SET arr[m.k]=m.v FROM m WHERE m.id=t.id idea

    As mentioned in the comment, it’s update‘s fault it won’t work:

    A target row shouldn’t join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.

    Even though the subscript addresses a specific element in the array value, update will not attempt to use more than one undetermined row matched from the source – in effect, for each id, only one (key,value) pair arrives to be applied, the rest is discarded.

    Without order by all things are unordered, and update doesn’t even offer an order by. If this worked, you’d see anomalies resulting from the unpredictable order the multiple row updates would be applied to the same row. In your case each row update targets a different index in the array value so that wouldn’t matter as they wouldn’t interfere with each other, but that’s not always the case.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search