I would like to to shuffle some columns from a table in Postgres database. I have 2 millions rows. I need to update all not null values by another.
I need to keep the same dataset. It’s not possible to have the same value two times. It’s not possible to swap data with next values because, if I do the same process with another column I will keep the same link. It’s to anonymise my database. Just need to shuffle data and keep the dataset.
exemple (change firstname and lastname):
id | firstname | lastname |
---|---|---|
1 | albert | einsten |
2 | isaac | newton |
3 | curie | |
4 | alexandre | Graham Bell |
5 | thomas | Edison |
shuffle firstname column:
id | firstname | lastname |
---|---|---|
1 | isaac | Graham Bell |
2 | albert | Edison |
3 | einsten | |
4 | thomas | newton |
5 | alexandre | curie |
How to do this with a speedy process?
2
Answers
Given the updated requirement, you might be better off using a strategy like Erwin Brandstetter’s solution that can easily be applied to an arbitrary number of columns, however I will leave my original answer with an update for doing the second column.
Original answer (requirement to shuffle one column):
Given how general your requirements are about the order of the shuffle, I’m not sure how much this will help practically, but I think it answers your question:
The idea here is that the inner query gets the values shifted by one (ignoring nulls). The
COALESCE
is used since the first one doesn’t have a proceeding entry, so it falls back on someLAST_VALUE
logic to get the last value (i.e. it behaves as if the shift loops around).The surrounding
UPDATE
statement joins test to the subquery to actually update the data.You can see it working in this Fiddle.
Updated (requirement to also shuffle a second column):
Given the updated requirement to also shuffle a second field, you could apply different logic there:
This simply shuffles the
lastname
in the opposite direction, sofirstname
is grabbed from the previous non-null row (wrapping around) andlastname
is grabbed from the next non-null row (wrapping around). Both columns will be changed.Here is a fiddle of it working
This shuffles values in the column
firstname
in a perfectly random fashion:"Perfectly random" includes the possibility that some columns might retain their original values. (The more rows the smaller the chance.) This is actually best for anonymizing data. Then values are truly random. If we force a switch, readers will get the minimum information that a different value was associated with a given ID.
It also observes your surprising rule to only shuffle not null values.
Repeat for every column you need to shuffle.
Without excluding
NULL
values, this single query works more cheaply for any number of columns:If
id
is a gap-less sequence we can removet1
from the equation, and joint0.id
tot2.rn
. (Wouldn’t work while excluding NULL values.)fiddle