skip to Main Content

I need to update a column in a PostGIS database.

I want to do that with this:

UPDATE 
  cj_geometry 
SET 
  groundgeometry = CASE WHEN NOT ST_IsValid(groundgeometry) THEN ST_Multi(
    ST_CollectionExtract(
      ST_MakeValid(groundgeometry), 
      3
    )
  ) ELSE ST_MakeValid(groundgeometry) END

It’s around 65.000.000 lines.

I let it run for an hour or so, then realized that there is no space left on device, so I quit the query and restarted the database. It did restart, but it’s now super big, without new data. So two questions:

  1. How do I get it back to its original size?
  2. Why does this use so much disk space? How can I avoid this?

2

Answers


  1. To shrink the table, end all long-running transactions, close all prepared transactions, drop all stale replication slots and run

    VACUUM (FULL) table_name;
    

    Likely causes for the problem are:

    1. you ran all the updates in a single transaction or there was a concurrent long-running transaction

    2. you ran the updates so quickly that autovacuum couldn’t keep up

    Login or Signup to reply.
  2. Postgres does not release storage back to the OS by default. Databases tend only to grow, so there’s not much call to release storage back. Postgres keeps this storage around so that future inserts and updates and simply reuse space that points to dead tuples. This is faster than asking the OS to extend the size of the file.

    Reclaiming Disk Space

    To reclaim assigned storage you can do a full vacuum. This essentially transfers the table to a new file, compacting the backing-file to its smallest possible size. This requires that you have enough space to store the old backing-file and the new-backing file simultaneously. So if you are very low on disk space, then this may not be an option.

    VACUUM (FULL) mytable;
    

    If you cannot do a full vacuum because of limited disk space, then your only option is to dump/export the table to a different host or partition, truncate/drop the table, and load/import the table again. This may require you to drop any foreign key constraints that point to your table. As such, to keep database integrity, you must stop and prevent any other users from accessing the database whilst you are performing this process.

    How to Avoid Using Too Much Disk Space

    To avoid this problem in future you can do incremental updates. This reduces the total number of reachable tuples at any given time, and so reduces the maximum size of the backing-file. That is, when you update every row in a table, postgres must be able to access every before and after row at the point you commit the transaction. This means that the backing file will have to be twice its normal size to accommodate all these rows.

    If you only update 1 million rows or so at time, then the total size of the backing file only needs to be large enough to accomodate 66 million tuples (65m + 1m). When you update the next batch, there will be 1 million tuples that are no longer reachable, and postgres can safely reuse this space. You will likely have to tell postgres that it should recycle these unreachable tuples though. For this technique to work, there must be no long running transactions that mean that the old rows are still reachable.

    BEGIN;
    UPDATE mytable SET mycol = newval WHERE <criteria> LIMIT 1000000;
    COMMIT;
    VACUUM;  -- recycles unreachable tuples, but retains file storage
    

    <criteria> should be able to exclude previously updated rows. A simple criteria is to use a unique column or primary key on the table. eg.

    WHERE 0 <= id < 1000000
    -- and then on the next update
    WHERE 1000000 <= id < 2000000
    

    And then update the minimum and maximum id values as you run successive updates.

    If there is no unique identifier on the table then you can also test a column to see if it needs updating, and skip any rows that do not need updating. This technique is more foolproof, but can take much longer overall. It takes longer as the database has to repeat work to see if a row needs updating or not. eg.

    UPDATE 
      mytable
    SET 
      row_to_update = f(x)
    WHERE
      row_to_update IS NULL
    LIMIT 1000000;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search