I am using SQLAlchemy ORM.
I have a Parent
class and a Child
class. Parent
and Child
are in a one-to-many relationship. A Parent
has many children, potentially even 20k-30k.
Here’s how my classes are set up
class Parent(Base):
__tablename__ = "parent"
id = mapped_column(BigInteger, primary_key=True, autoincrement=True)
# associations
children = relationship(
"Child",
back_populates="parent",
uselist=True,
cascade="all, delete-orphan",
primaryjoin='and_(Parent.id == Child.parent_id, Child.deleted_at == None)'
)
class Child(Base):
__tablename__ = "child"
id = mapped_column(BigInteger, primary_key=True, autoincrement=True)
# associations:
parent_id = mapped_column(BigInteger, ForeignKey("parent.id",
ondelete="CASCADE"),
nullable=True)
parent = relationship("Parent", back_populates="children")
The problem is that when I try to establish a one-to-many relationship between one Parent and 10k children, the flushing operation takes 2 MINUTES when ran against a real instance of a PostgresSQL database. (The database is running in a high performance server and my laptop has an excellent connection.)
Here’s the code that I use for performance testing:
session_maker = sessionmaker(autocommit=False, autoflush=False)
session = session_maker()
parent = Parent()
session.add(parent)
session.flush() # fast
session.commit() # fast
children = []
for i in range(10000):
child = Child()
children.append(child)
session.add_all(children)
session.flush() # fast
session.commit() # fast
parent.children.extend(children)
for child in children:
child.parent = parent
session.add(parent)
session.add_all(children)
session.flush() # super slow
session.commit() # super slow
What is weird to me is that I can create and flush 10k children in less than a second, but then when I need to establish a simple one-to-many relation it takes 2 minutes.
I refuse to believe this is normal.
EDIT 1:
I changed the testing script to this and now it’s much faster:
session_maker = sessionmaker(autocommit=False, autoflush=False)
session = session_maker()
parent = Parent()
session.add(parent)
children = []
for _ in range(n_children):
child = Child()
children.append(child)
session.add_all(children)
assert len(children) == n_children
for child in children:
child.parent = parent
parent.children.extend(children)
session.add(parent)
session.add_all(children)
# FLUSHING and COMMITTING:
start_time = time.time()
session.flush() # fast
session.commit()
print(time.time() - start_time)
for i in range(0, n_children - 1):
assert children[i].id + 1 == children[i + 1].id
assert children[i].parent == parent
assert len(parent.children) == n_children
session.close()
As you can see instead of flushing every time, I only flush once at the end.
Could someone explain why this is much faster?
Also this will cause some problems in my code as until flushing ids are not defined.
EDIT 2:
When establishing a one-to-many relationship between Parent and Child we are updating the parent_id column of the Child table. Therefore, we are essentially trying to perform 10k UPDATE operations. For some reason, SQLAlchemy is performing this operations sequentially, therefore going back and forth to the database 10k times. This theory is reinforced by the fact that if I use a local database the execution time is much faster (but still slow), around 10s.
Is there a way to perform the updates in a parallel way? I was looking at session.bulk_update_mappings
but the execution time is still the same. Maybe there are some parameters for the engine
2
Answers
These do the same thing but the first one has to execute an UPDATE for each child whereas the second one does not need to do this.
Here is the echo output, it can be hard to read but you can see the inserts and updates in there.
When you persist the child objects before they have been associated with a parent, you insert the rows into the child table with
parent_id
as NULL. That is,generates
Then, when you associate them with a parent, SQLAlchemy has to update all of those child rows with actual parent_id.
SQLAlchemy creates the parent row
and then calls the DBAPI
.executemany()
method to update the child rowsbut a Wireshark trace shows that what psycopg2 actually sends to the server are separate UPDATE statements for each row
that is, one round-trip for each child row. That will obviously be slow for a large number of children.
Instead, associate the newly-created child objects to the parent before flushing, which creates the parent and then creates each child with the appropriate parent_id
which generates