I currently have a database schema like the following table:
CREATE TABLE Measures(
expId SERIAL,
iteration INT NOT NULL,
value float4 NOT NULL,
PRIMARY KEY(expId, iteration)
);
So, a table of various measurements, repeated for n iterations.
Though, because we have more data than originally expected, I want to move to a new table layout that instead uses an array column, which overall gives better performance (already tested and benchmarked):
CREATE TABLE TmpMeasures(
expId SERIAL PRIMARY KEY,
values float4[] NOT NULL
);
My problem now is how to get the old data into the new format.
The data may look something like this. Not that iterations don’t
always produce all data, so there may be NULL
values in the final array:
INSERT INTO Measures (expId, iteration, value)
VALUES
(1, 1, 1.1), (1, 2, 1.2), (1, 3, 1.3),
(2, 1, 2.1), (2, 4, 2.4),
(3, 1, 3.1), (3, 4, 3.4);
And conversion could be done with a two step process, roughly like this, to first create the array for an experiment, and then populate the iteration values:
INSERT INTO TmpMeasures(expId, values)
SELECT expId, '{}'::float4[]
FROM Measures
ON CONFLICT DO NOTHING;
UPDATE TmpMeasures tm
SET values[iteration] = m.value
FROM Measures m WHERE tm.expId = m.expId;
Though, my problem now is that the UPDATE
actually only ever seems to take the first iteration, i.e., iteration = 1
.
I am not quite understanding why that is the case.
I suspect, alternative approaches to values[iteration]
would try to group by expId, and order by iteration and aggregate that into an array.
Unfortunately, the data isn’t perfect, but iterations should line up.
So, the following seems to work, but it’s extremely slow, and I don’t quite understand why it’s needed in the first place.
DO
$do$
BEGIN
FOR i IN 1..(SELECT max(iteration) FROM Measures m) LOOP
UPDATE TmpMeasures tm
SET values[i] = m.value
FROM Measures m
WHERE
tm.expId = m.expId AND
m.iteration=i;
END LOOP;
END
$do$;
Why does the "normal" update statement not suffice?
Answer: thanks to @Zegarek’s pointer below, this is indeed expected behavior of UPDATE, which only takes a single row from the FROM clause.
So, the more relevant question for me at this point is whether there’s a better approach for the data conversion, that produces for instance [3.1, NULL, NULL, 3.4]
for expId=3
.
2
Answers
Just select the values you want to be inserted:
Start selecting the unique
expid
s, usingSELECT distinct expid FROM Measures
because not every iteration has all values.see: DBFIDDLE
output:
EDIT: An alternative solution, because you might have more than 4 iterations.
see: DBFIDDLE
iterations
to latergenerate_series()
for everything in between.distinct dExpId
toleft join
it with the iterations to getnull
for missingvalues
,array_agg(values order by i)
.update
, you cancreate table as
and add the constraints later: demoAdding constraints afterwards: (
identity
columns are encouraged overserial
).The select is needed before that because you can’t run it as a scalar subselect in
alter table
. The demo also shows how to inspect each step, and how to run it as anupdate
instead.If you have some iterations that are missing from all concurrent
Measures
and you want to skip that field in all arrays, you can usewhere i in ()
to filter thegenerate_series() g(i)
.If you were looking for performance, on 28k samples (60
expId
s with 800iteration
s each, 40% chance to miss an iteration) this takes0.25s
: performance demo. Looped PL/pgSQL update needs15.80s
. In v16, the difference increases to0.08s
vs running out of resources and failing.The
UPDATE t SET arr[m.k]=m.v FROM m WHERE m.id=t.id
ideaAs mentioned in the comment, it’s
update
‘s fault it won’t work:Even though the subscript addresses a specific element in the array value,
update
will not attempt to use more than one undetermined row matched from the source – in effect, for eachid
, only one(key,value)
pair arrives to be applied, the rest is discarded.Without
order by
all things are unordered, andupdate
doesn’t even offer anorder by
. If this worked, you’d see anomalies resulting from the unpredictable order the multiple row updates would be applied to the same row. In your case each row update targets a different index in the array value so that wouldn’t matter as they wouldn’t interfere with each other, but that’s not always the case.