skip to Main Content

Simplified scenario:

I want to update one table A (10 million rows) with a value from another table B (20 mil rows). The two tables are linked with ids.

It took more than 7hrs to update the whole thing in one go. (I don’t know exactly how long as I stopped the script in the middle.)
So my idea is to update table A in batches using OFFSET and LIMIT clauses. So far with no luck.

Wrapped up in a procedure, the code looks like below:

DECLARE 
    offset_number integer := 0;
    batch_size integer := 1000; 

BEGIN
    LOOP 
        UPDATE TableA temp1
        SET TableA_column_value_to_be_updated = (
            SELECT 
                tableB_column_value
            FROM 
                TableB temp2
            WHERE 
                temp2.id = temp1.id 
                AND some_other_conditions_in_TableB    
            ) 
        WHERE    
             some_other_conditions_in_Table
        OFFSET offset_number 
        LIMIT batch_size ; 
            
        COMMIT; 
    
        offset_number := offset_number + batch_size;
        EXIT WHEN NOT FOUND; 
    END LOOP;
END;

The engine reports an error with exception:

org.jkiss.dbeaver.model.sql.DBSQLException:
SQL Error [42601]: ERROR: syntax error at or near "OFFSET"

I have no idea what it is. Notably, it seems to work without OFFSET and LIMIT.

Any ideas why this would happen? Should I use other loop statement?

2

Answers


  1. LIMIT and OFFSET are not in the syntax of an SQL UPDATE statement. You need SELECT for that.
    Also, OFFSET scales poorly to "paginate" a big table. Remember the upper bound from the last iteration instead.

    Something like this could work:

    CREATE OR REPLACE PROCEDURE upd_in_batches(_batch_size int = 1000)
      LANGUAGE plpgsql AS
    $proc$
    DECLARE
       _id_bound int = 0;  -- or whatever?
    BEGIN
       LOOP
          WITH sel AS (
             SELECT a.id  -- id = PK!
             FROM   tablea a
             WHERE  a.id > _id_bound
          -- AND    <some other conditions in Table A>
             ORDER  BY a.id
             LIMIT  _batch_size
             FOR    UPDATE
             )
          , upd AS (
             UPDATE tablea a
             SET    target_col = b.b_source_col
             FROM   sel s
             JOIN   tableb b USING (id)
             WHERE  a.id = s.id
             AND    a.target_col IS DISTINCT FROM b.b_source_col
             )
          SELECT max(id)  -- always returns a row
          FROM   sel
          INTO   _id_bound;
    
          IF _id_bound IS NULL THEN
              EXIT;  -- no more rows found; we're done, exit loop
          ELSE
             COMMIT;
          END IF;
       END LOOP;
    END
    $proc$
    

    Use a SELECT statement instead to apply your LIMIT. To avoid race conditions with concurrent writes, throw in a locking clause (FOR UPDATE). You may or may not need that.

    You might be able to iUPDATE directly and just increment lower & upper bound for the filter on id instead, which is cheaper. Depends on the details of your setup and requirements. Each has its caveats.

    See:

    Login or Signup to reply.
  2. The issue here is in using OFFSET and LIMIT within the UPDATE statement. Most SQL dialects (like PostgreSQL) do not support using OFFSET and LIMIT in an UPDATE query as they are usually used in SELECT statements.

    Here’s an alternative solution:

    DECLARE 
        offset_number INTEGER := 0;
        batch_size INTEGER := 1000; 
    
    BEGIN
        LOOP 
            -- Update rows in batches using a subquery to limit the rows processed
            WITH cte AS (
                SELECT temp1.id
                FROM TableA temp1
                WHERE some_other_conditions_in_Table
                ORDER BY temp1.id
                OFFSET offset_number
                LIMIT batch_size
            )
            UPDATE TableA temp1
            SET TableA_column_value_to_be_updated = (
                SELECT tableB_column_value
                FROM TableB temp2
                WHERE temp2.id = temp1.id AND some_other_conditions_in_TableB
            )
            WHERE temp1.id IN (SELECT id FROM cte);
    
            -- Commit after each batch
            COMMIT;
    
            -- Exit loop if no more rows are returned
            EXIT WHEN NOT FOUND;
    
            -- Increment the offset for the next batch
            offset_number := offset_number + batch_size;
        END LOOP;
    END;
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search