- process 1: insert & update(with subquery)
- process 2: select & update
- process 3: pure select
the violation error occurs in two places (table and process described below)
-
somethings the process 1 might fail, then in order to fill the data ASAP, there will be multiple processes run at one time, then the process 2 (update data) will get
serialization isolation violation
with 3 transaction cycle, this can be explained, since this is possible: if the insert from process 1 happened after selection, and before update in process 2. -
the process 3 gets the same error
serialization isolation violation
Question
Why pure select gets isolation error in process 3?
The error begins in process 2, when the ETL process 1 finished, the error will disappear in both process 2 and process3.
Table
table data
:
id,time,company,info,count
The Process
there are 3 process:
1: insert into data
we will insert the data into table from log data, update the count, the sql looks like this:
insert into data (time, company, info, count) values ('2023-08-01 00:00:00', 'A', '', 1);
update data set count = count where id in (select id from data where
...);
this process is a ETL process written in java, run once per hour, as far as i know, autocommit
is default True
.
2: update data
we need to update the info
field of the table, since the info
are not present in ETL process, the process looks like this:
select id, company from data where date(time) = '2023-08-01' and info = '';
// do some calculation in python
update info = info where id = id;
commit;
this process using python with psycopg2 >= 2.7, besides autocommit
is default value of False
, run once per day.
3: query
we need to get the data for offline report, the sql:
select * from data where date(time) = '2023-08-01';
select count(*) from data where date(time) = '2023-08-01';
select company, count(company) from data where date(time) = '2023-08-01' group by company;
and this process using python with psycopg2 >= 2.7, autocommit
is False
by default.
Code of Process 3
conn = psycopg2.connect(config, connect_timeout=10)
def query(sql):
cur = conn.cursor()
cur.execute(sql)
try:
while True:
rows = cur.fetchmany(1000)
if not rows:
break
for row in rows:
yield row
except: # noqa
pass
cur.close()
def process(rows):
for row in rows:
pass
process(query("select * from data where date(time) = '2023-08-01'"))
process(query("select count(*) from data where date(time) = '2023-08-01'"))
conn.close()
2
Answers
It sounds like you have some understanding of what is going on and how MVCC works so I’ll go quickly past this. However, you haven’t provided enough information to know exactly what is going on but your question is about ‘how it could happen’ so a possible way is all you are looking for.
These serialization errors happen when at least 2 transactions create conflicting commit dependencies that cannot be resolved. Redshift then needs to pick one (usually the one that commits first) to be committed and the other will be terminated and its updates lost. From your description it looks like you have debugged this enough to think that process 1 (containing 2 transactions) is the one succeeding and process 2 (1 transaction) is being lost. (Knowing exactly what is going on takes tracking the xids and oids involved and the exact timing of the statements in each transaction.) You also state that these are separate "processes" which means that there are 3 separate connections to Redshift each with a unique PID (process id aka session).
Given this you ask ‘why is the 3rd transaction failing?’. It is likely that that source information of transaction 3 has been lost due to the serialization error. But, you ask, ‘if transaction 3 is read-only can can only use previously committed information how can this be?’. It should have no dependency on uncommitted results which, I think, is the dilemma you are pointing to, no?
The first possibility is that transaction 3 isn’t as "clean" as you think. There are several ways this can happen:
connection pooling in play? Does the python code run any SQL before
this snippet that might not be committed? Is this code being run
from Lambda?
To know for sure what is happening you will need to list out all the statements for these 3 sessions. Include the statement text, pid, xid, and exact time. Highlight the xids and tables (oid) that are referenced in the error message. Note when each transaction references these tables and when they are committed (you may need to align the commit system table with this info. You are looking for when 2 transactions are trying to commit the same table but used different source information for calculating their update information. This is a bit of a pain and I expect you will be surprised that one of these transactions has a lot more going on than you expect.
Once you understand the order of events I expect that the true reason for transaction 3 being terminated will be clear.
MVCC is a long and complex subject, and cannot be explained in an on-line text-only exchange.
I’ve published a PDF which explains MVCC, and in RS in particular;
https://www.redshiftresearchproject.org/white_papers/downloads/multi_version_concurrency_control_and_serialization_isolation_failure.pdf
You may also want to look at Combobulator, which will let you inspect your cluster and see what’s actually going on;
https://www.redshiftresearchproject.org/combobulator/index.html