skip to Main Content

I would like to to shuffle some columns from a table in Postgres database. I have 2 millions rows. I need to update all not null values by another.

I need to keep the same dataset. It’s not possible to have the same value two times. It’s not possible to swap data with next values because, if I do the same process with another column I will keep the same link. It’s to anonymise my database. Just need to shuffle data and keep the dataset.

exemple (change firstname and lastname):

id firstname lastname
1 albert einsten
2 isaac newton
3 curie
4 alexandre Graham Bell
5 thomas Edison

shuffle firstname column:

id firstname lastname
1 isaac Graham Bell
2 albert Edison
3 einsten
4 thomas newton
5 alexandre curie

How to do this with a speedy process?

2

Answers


  1. Given the updated requirement, you might be better off using a strategy like Erwin Brandstetter’s solution that can easily be applied to an arbitrary number of columns, however I will leave my original answer with an update for doing the second column.

    Original answer (requirement to shuffle one column):

    Given how general your requirements are about the order of the shuffle, I’m not sure how much this will help practically, but I think it answers your question:

    update test
    SET firstname = t2.firstname
    FROM
    (
     SELECT id, COALESCE(LAG(firstname, 1) OVER (ORDER BY id RANGE UNBOUNDED PRECEDING), LAST_VALUE(firstname) OVER (ORDER BY id RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) firstname
     FROM test t2
     WHERE firstname IS NOT NULL
    ) t2
    WHERE test.id = t2.id
    

    The idea here is that the inner query gets the values shifted by one (ignoring nulls). The COALESCE is used since the first one doesn’t have a proceeding entry, so it falls back on some LAST_VALUE logic to get the last value (i.e. it behaves as if the shift loops around).

    The surrounding UPDATE statement joins test to the subquery to actually update the data.

    You can see it working in this Fiddle.

    Updated (requirement to also shuffle a second column):

    Given the updated requirement to also shuffle a second field, you could apply different logic there:

    update test
    SET firstname = t2.firstname,
      lastname = t2.lastname
    FROM
    (
     SELECT id, 
      COALESCE(LAG(firstname, 1) OVER (ORDER BY id), LAST_VALUE(firstname) OVER (ORDER BY id RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) firstname,
      COALESCE(LEAD(lastname, 1) OVER (ORDER BY id), FIRST_VALUE(lastname) OVER (ORDER BY id RANGE UNBOUNDED PRECEDING)) lastname
     FROM test t2
     WHERE firstname IS NOT NULL AND lastname IS NOT NULL
    ) t2
    WHERE test.id = t2.id
    

    This simply shuffles the lastname in the opposite direction, so firstname is grabbed from the previous non-null row (wrapping around) and lastname is grabbed from the next non-null row (wrapping around). Both columns will be changed.

    Here is a fiddle of it working

    Login or Signup to reply.
  2. This shuffles values in the column firstname in a perfectly random fashion:

    UPDATE test t0
    SET    firstname = t2.firstname
    FROM  (SELECT row_number() OVER (ORDER BY random()) AS rn, id        FROM test WHERE firstname IS NOT NULL) t1
    JOIN  (SELECT row_number() OVER (ORDER BY random()) AS rn, firstname FROM test WHERE firstname IS NOT NULL) t2 USING (rn)
    WHERE  t0.id = t1.id
    AND    t0.firstname IS NOT NULL;
    

    "Perfectly random" includes the possibility that some columns might retain their original values. (The more rows the smaller the chance.) This is actually best for anonymizing data. Then values are truly random. If we force a switch, readers will get the minimum information that a different value was associated with a given ID.

    It also observes your surprising rule to only shuffle not null values.

    Repeat for every column you need to shuffle.

    Without excluding NULL values, this single query works more cheaply for any number of columns:

    UPDATE test t0
    SET    firstname = t2.firstname
         , lastname  = t3.lastname
    FROM  (SELECT row_number() OVER (ORDER BY random()) AS rn, id        FROM test) t1
    JOIN  (SELECT row_number() OVER (ORDER BY random()) AS rn, firstname FROM test) t2 USING (rn)
    JOIN  (SELECT row_number() OVER (ORDER BY random()) AS rn, lastname  FROM test) t3 USING (rn)
    WHERE  t0.id = t1.id;
    

    If id is a gap-less sequence we can remove t1 from the equation, and join t0.id to t2.rn. (Wouldn’t work while excluding NULL values.)

    fiddle

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search