skip to Main Content

I’m doing a simple COPY command that used to work:

echo " COPY table_name
FROM 's3://bucket/<date>/'  
iam_role 'arn:aws:iam::123:role/copy-iam'
format as json 's3://bucket/jupath.json' 
gzip ACCEPTINVCHARS ' ' TRUNCATECOLUMNS TRIMBLANKS MAXERROR 3;
" | psql

And now I get:

INFO:  Load into table 'table_name' completed, 53465077 record(s) loaded successfully.
ERROR:  deadlock detected
DETAIL:  Process 26999 waits for AccessExclusiveLock on relation 3176337 of database 108036; blocked by process 26835.
Process 26835 waits for ShareLock on transaction 24230722; blocked by process 26999.

The only change is moving from dc2 instance type to ra3. Let me add this is the only command touches this table and there is only one process at a time.

2

Answers


  1. Chosen as BEST ANSWER

    Clearly a bug.

    Table was cloned from one redshift to another by doing SHOW TABLE table_name, which provided:

     CREATE TABLE table_name (        
         message character varying(50) ENCODE lzo,         
         version integer ENCODE az64,                      
         id character varying(100) ENCODE lzo ,
         access character varying(25) ENCODE lzo,          
         type character varying(25) ENCODE lzo,            
         product character varying(50) ENCODE lzo,      
     )                                                     
     DISTSTYLE AUTO SORTKEY AUTO ;
    

    After removing the "noise" the command completed as usual without errors:

     DROP TABLE table_name;
     CREATE TABLE table_name (        
         message character varying(50),         
         version integer,                      
         id character varying(100),
         access character varying(25),          
         type character varying(25),            
         product character varying(50),      
     );
    

  2. The key detail here is in the error message:

    Process 26999 waits for AccessExclusiveLock on relation 3176337 of
    database 108036; blocked by process 26835. Process 26835 waits for
    ShareLock on transaction 24230722; blocked by process 26999.

    Relation 3176337, I assume, is the table in question – the target of the COPY. This should be confirmed by running something like:

    select distinct(id) table_id
    ,trim(datname)   db_name
    ,trim(nspname)   schema_name
    ,trim(relname)   table_name
    from stv_tbl_perm
    join pg_class on pg_class.oid = stv_tbl_perm.id
    join pg_namespace on pg_namespace.oid = relnamespace
    join pg_database on pg_database.oid = stv_tbl_perm.db_id 
    ;
    

    I don’t expect any surprises here but it is good to check. If it is some different table (object) then this is important to know.

    Now for the meat. You have 2 processes listed in the error message – PID 26999 and PID 26835. A process is a unique connection to the database or a session. So these are identifying the 2 connections to the database that have gotten locked with each other. So a good next step is to see what each of these sessions (processes or PIDs) are doing. Like this:

    select xid, pid, starttime, max(datediff('sec',starttime,endtime)) as runtime, type, listagg(regexp_replace(text,'\\n*',' ')) WITHIN GROUP (ORDER BY sequence) || ';' as querytext
      from svl_statementtext
      where pid in (26999, 26835) 
      --where xid = 16627013
        and sequence < 320
        --and starttime > getdate() - interval '24 hours'
      group by starttime, 1, 2, "type" order by starttime, 1 asc, "type" desc ;
    

    The thing you might run into is that these logging table "recycle" every few days so the data from this exact failure might be lost.

    The next part of the error is about the open transaction that is preventing 26835 from moving forward. This transaction (identified by an XID, or transaction ID) is preventing 26835 progressing and is part of process 26999 but 26999 needs 26835 to complete some action before it a move – a deadlock. So seeing what is in this transaction will be helpful as well:

    select xid, pid, starttime, max(datediff('sec',starttime,endtime)) as runtime, type, listagg(regexp_replace(text,'\\n*',' ')) WITHIN GROUP (ORDER BY sequence) || ';' as querytext
      from svl_statementtext
      where xid = 24230722
        and sequence < 320
        --and starttime > getdate() - interval '24 hours'
      group by starttime, 1, 2, "type" order by starttime, 1 asc, "type" desc ;
    

    Again the data may have been lost due to time. I commented out the date range where clause of the last 2 queries to allow for looking back further in these tables. You should also be aware that PID and XID numbers are reused so check the date stamps on the results to be sure that that info from different sessions aren’t be combined. You may need a new where clause to focus in on just the event you care about.

    Now you should have all the info you need to see why this deadlock is happening. Use the timestamps of the statements to see the order in which statements are being issued by each session (process). Remember that every transaction ends with a COMMIT (or ROLLBACK) and this will change the XID of the following statements in the session. A simple fix might be issuing a COMMIT in the "26999" process flow to close that transaction and let the other process advance. However, you need to understand if such a commit will cause other issues.

    If you can find all this info and if you need any help reach out.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search