I have a table that runs some heavier computation (process length ~ 5 minutes per key). I want to reserve jobs and run it on multiple machines. I noticed that computers get locked out from the table as soon as one machine starts processing a job – they effectively have to wait until one of the jobs finished before it starts its own, or gets a chance to grab a job. Where does this behavior stem from? I seem to run into "Lock wait timeout exceeded errors" on other machines then the one that is currently processing a job when the job is taking too long.
@schema
class HeavyComputation(dj.Computed):
definition = """
# ...
-> Table1
class_label : varchar(25)
-> Table2.proj(somekey2="somekey")
---
analyzed : longblob
I am running .populate() on the table with
settings = {"display_progress": True,
"reserve_jobs": True,
"suppress_errors": True,
"order": "random"}
2
Answers
The problem turned out to be a
.delete()
call inside a sub function of my make function. I am taking track of temporary files inside another (unrelated) table and wanted things to be cleaned once the make routine finishes. However, this.delete
was running into a table lock and thereby prevented the.populate
call to finish.Yes, this is a tricky problem with how transaction serialization works. I will explain in a bit more detail and provide additional background but the solution is to reorder the primary key attributes in the table:
Again, I will provide a detailed explanation later since it will take some time to write up. I did not want to make you wait.