skip to Main Content
  1. process 1: insert & update(with subquery)
  2. process 2: select & update
  3. process 3: pure select

the violation error occurs in two places (table and process described below)

  • somethings the process 1 might fail, then in order to fill the data ASAP, there will be multiple processes run at one time, then the process 2 (update data) will get serialization isolation violation with 3 transaction cycle, this can be explained, since this is possible: if the insert from process 1 happened after selection, and before update in process 2.

  • the process 3 gets the same error serialization isolation violation

Question

Why pure select gets isolation error in process 3?
The error begins in process 2, when the ETL process 1 finished, the error will disappear in both process 2 and process3.

Table

table data:

id,time,company,info,count

The Process

there are 3 process:

1: insert into data

we will insert the data into table from log data, update the count, the sql looks like this:

insert into data (time, company, info, count) values ('2023-08-01 00:00:00', 'A', '', 1);
update data set count = count where id in (select id from data where 
 ...);

this process is a ETL process written in java, run once per hour, as far as i know, autocommit is default True.

2: update data

we need to update the info field of the table, since the info are not present in ETL process, the process looks like this:

select id, company from data where date(time) = '2023-08-01' and info = '';
// do some calculation in python
update info = info where id = id;
commit;

this process using python with psycopg2 >= 2.7, besides autocommit is default value of False, run once per day.

3: query

we need to get the data for offline report, the sql:

select * from data where date(time) = '2023-08-01';

select count(*) from data where date(time) = '2023-08-01';

select company, count(company) from data where date(time) = '2023-08-01' group by company;

and this process using python with psycopg2 >= 2.7, autocommit is False by default.

Code of Process 3


conn = psycopg2.connect(config, connect_timeout=10)

def query(sql):
    cur = conn.cursor()
    cur.execute(sql)
    try:
        while True:
            rows = cur.fetchmany(1000)
            if not rows:
                break
            for row in rows:
                yield row
    except: # noqa
        pass
    cur.close()

def process(rows):
    for row in rows:
        pass

process(query("select * from data where date(time) = '2023-08-01'"))
process(query("select count(*) from data where date(time) = '2023-08-01'"))

conn.close()

2

Answers


  1. It sounds like you have some understanding of what is going on and how MVCC works so I’ll go quickly past this. However, you haven’t provided enough information to know exactly what is going on but your question is about ‘how it could happen’ so a possible way is all you are looking for.

    These serialization errors happen when at least 2 transactions create conflicting commit dependencies that cannot be resolved. Redshift then needs to pick one (usually the one that commits first) to be committed and the other will be terminated and its updates lost. From your description it looks like you have debugged this enough to think that process 1 (containing 2 transactions) is the one succeeding and process 2 (1 transaction) is being lost. (Knowing exactly what is going on takes tracking the xids and oids involved and the exact timing of the statements in each transaction.) You also state that these are separate "processes" which means that there are 3 separate connections to Redshift each with a unique PID (process id aka session).

    Given this you ask ‘why is the 3rd transaction failing?’. It is likely that that source information of transaction 3 has been lost due to the serialization error. But, you ask, ‘if transaction 3 is read-only can can only use previously committed information how can this be?’. It should have no dependency on uncommitted results which, I think, is the dilemma you are pointing to, no?

    The first possibility is that transaction 3 isn’t as "clean" as you think. There are several ways this can happen:

    1. Transaction 3 is in the same session as transaction #2. Are these truly unique sessions?
    2. There is some uncommitted information in transaction 3. Is there
      connection pooling in play? Does the python code run any SQL before
      this snippet that might not be committed? Is this code being run
      from Lambda?
    3. Transaction 3 is part of the loop. What tables (objects) are referenced in transaction 3? Are these the serialization error table you expect? Are there views involved that obscure the dependencies?

    To know for sure what is happening you will need to list out all the statements for these 3 sessions. Include the statement text, pid, xid, and exact time. Highlight the xids and tables (oid) that are referenced in the error message. Note when each transaction references these tables and when they are committed (you may need to align the commit system table with this info. You are looking for when 2 transactions are trying to commit the same table but used different source information for calculating their update information. This is a bit of a pain and I expect you will be surprised that one of these transactions has a lot more going on than you expect.

    Once you understand the order of events I expect that the true reason for transaction 3 being terminated will be clear.

    Login or Signup to reply.
  2. MVCC is a long and complex subject, and cannot be explained in an on-line text-only exchange.

    I’ve published a PDF which explains MVCC, and in RS in particular;

    https://www.redshiftresearchproject.org/white_papers/downloads/multi_version_concurrency_control_and_serialization_isolation_failure.pdf

    You may also want to look at Combobulator, which will let you inspect your cluster and see what’s actually going on;

    https://www.redshiftresearchproject.org/combobulator/index.html

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search