Postgresql - How to shuffle column values while keeping same dataset

user3659832
October 14, 2022
283 views
0 votes
2 Answers

I would like to to shuffle some columns from a table in Postgres database. I have 2 millions rows. I need to update all not null values by another.

I need to keep the same dataset. It’s not possible to have the same value two times. It’s not possible to swap data with next values because, if I do the same process with another column I will keep the same link. It’s to anonymise my database. Just need to shuffle data and keep the dataset.

exemple (change firstname and lastname):

id	firstname	lastname
1	albert	einsten
2	isaac	newton
3		curie
4	alexandre	Graham Bell
5	thomas	Edison

shuffle firstname column:

id	firstname	lastname
1	isaac	Graham Bell
2	albert	Edison
3		einsten
4	thomas	newton
5	alexandre	curie

How to do this with a speedy process?

Answers

- EdmCoff
- October 14, 2022 at 12:53 am
- 0 votes
0
Given the updated requirement, you might be better off using a strategy like Erwin Brandstetter’s solution that can easily be applied to an arbitrary number of columns, however I will leave my original answer with an update for doing the second column.

Original answer (requirement to shuffle one column):

Given how general your requirements are about the order of the shuffle, I’m not sure how much this will help practically, but I think it answers your question:
```
update test
SET firstname = t2.firstname
FROM
(
 SELECT id, COALESCE(LAG(firstname, 1) OVER (ORDER BY id RANGE UNBOUNDED PRECEDING), LAST_VALUE(firstname) OVER (ORDER BY id RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) firstname
 FROM test t2
 WHERE firstname IS NOT NULL
) t2
WHERE test.id = t2.id
```
The idea here is that the inner query gets the values shifted by one (ignoring nulls). The COALESCE is used since the first one doesn’t have a proceeding entry, so it falls back on some LAST_VALUE logic to get the last value (i.e. it behaves as if the shift loops around).

The surrounding UPDATE statement joins test to the subquery to actually update the data.

You can see it working in this Fiddle.

Updated (requirement to also shuffle a second column):

Given the updated requirement to also shuffle a second field, you could apply different logic there:
```
update test
SET firstname = t2.firstname,
  lastname = t2.lastname
FROM
(
 SELECT id, 
  COALESCE(LAG(firstname, 1) OVER (ORDER BY id), LAST_VALUE(firstname) OVER (ORDER BY id RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) firstname,
  COALESCE(LEAD(lastname, 1) OVER (ORDER BY id), FIRST_VALUE(lastname) OVER (ORDER BY id RANGE UNBOUNDED PRECEDING)) lastname
 FROM test t2
 WHERE firstname IS NOT NULL AND lastname IS NOT NULL
) t2
WHERE test.id = t2.id
```
This simply shuffles the lastname in the opposite direction, so firstname is grabbed from the previous non-null row (wrapping around) and lastname is grabbed from the next non-null row (wrapping around). Both columns will be changed.

Here is a fiddle of it working
Login or Signup to reply.

- ErwinBrandstetter
- October 14, 2022 at 4:25 am
- 0 votes
0
This shuffles values in the column firstname in a perfectly random fashion:
```
UPDATE test t0
SET    firstname = t2.firstname
FROM  (SELECT row_number() OVER (ORDER BY random()) AS rn, id        FROM test WHERE firstname IS NOT NULL) t1
JOIN  (SELECT row_number() OVER (ORDER BY random()) AS rn, firstname FROM test WHERE firstname IS NOT NULL) t2 USING (rn)
WHERE  t0.id = t1.id
AND    t0.firstname IS NOT NULL;
```
"Perfectly random" includes the possibility that some columns might retain their original values. (The more rows the smaller the chance.) This is actually best for anonymizing data. Then values are truly random. If we force a switch, readers will get the minimum information that a different value was associated with a given ID.

It also observes your surprising rule to only shuffle not null values.

Repeat for every column you need to shuffle.

Without excluding NULL values, this single query works more cheaply for any number of columns:
```
UPDATE test t0
SET    firstname = t2.firstname
     , lastname  = t3.lastname
FROM  (SELECT row_number() OVER (ORDER BY random()) AS rn, id        FROM test) t1
JOIN  (SELECT row_number() OVER (ORDER BY random()) AS rn, firstname FROM test) t2 USING (rn)
JOIN  (SELECT row_number() OVER (ORDER BY random()) AS rn, lastname  FROM test) t3 USING (rn)
WHERE  t0.id = t1.id;
```
If id is a gap-less sequence we can remove t1 from the equation, and join t0.id to t2.rn. (Wouldn’t work while excluding NULL values.)

fiddle
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – How to shuffle column values while keeping same dataset

Answers