skip to Main Content

I am using SQLAlchemy ORM.
I have a Parent class and a Child class. Parent and Child are in a one-to-many relationship. A Parent has many children, potentially even 20k-30k.

Here’s how my classes are set up

class Parent(Base):
__tablename__ = "parent"

id = mapped_column(BigInteger, primary_key=True, autoincrement=True)


# associations
children = relationship(
    "Child",
    back_populates="parent",
    uselist=True,
    cascade="all, delete-orphan",
    primaryjoin='and_(Parent.id == Child.parent_id, Child.deleted_at == None)'
)


class Child(Base):
__tablename__ = "child"


id = mapped_column(BigInteger, primary_key=True, autoincrement=True)

# associations:
parent_id = mapped_column(BigInteger, ForeignKey("parent.id",
                                                       ondelete="CASCADE"),
                                            nullable=True)
parent = relationship("Parent", back_populates="children")

The problem is that when I try to establish a one-to-many relationship between one Parent and 10k children, the flushing operation takes 2 MINUTES when ran against a real instance of a PostgresSQL database. (The database is running in a high performance server and my laptop has an excellent connection.)

Here’s the code that I use for performance testing:

session_maker = sessionmaker(autocommit=False, autoflush=False)
session = session_maker()

parent = Parent()
session.add(parent)
session.flush() # fast
session.commit() # fast

children = []
for i in range(10000):
    child = Child()
    children.append(child)
session.add_all(children)
session.flush() # fast
session.commit() # fast


parent.children.extend(children)
for child in children:
    child.parent = parent
session.add(parent)
session.add_all(children)
session.flush() # super slow
session.commit() # super slow

What is weird to me is that I can create and flush 10k children in less than a second, but then when I need to establish a simple one-to-many relation it takes 2 minutes.

I refuse to believe this is normal.

EDIT 1:
I changed the testing script to this and now it’s much faster:

session_maker = sessionmaker(autocommit=False, autoflush=False)
session = session_maker()

parent = Parent()
session.add(parent)

children = []
for _ in range(n_children):
    child = Child()
    children.append(child)
session.add_all(children)

assert len(children) == n_children

for child in children:
    child.parent = parent
parent.children.extend(children)
session.add(parent)
session.add_all(children)

# FLUSHING and COMMITTING:
start_time = time.time()

session.flush() # fast
session.commit()

print(time.time() - start_time)


for i in range(0, n_children - 1):
    assert children[i].id + 1 == children[i + 1].id
    assert children[i].parent == parent
assert len(parent.children) == n_children

session.close()

As you can see instead of flushing every time, I only flush once at the end.
Could someone explain why this is much faster?
Also this will cause some problems in my code as until flushing ids are not defined.

EDIT 2:
When establishing a one-to-many relationship between Parent and Child we are updating the parent_id column of the Child table. Therefore, we are essentially trying to perform 10k UPDATE operations. For some reason, SQLAlchemy is performing this operations sequentially, therefore going back and forth to the database 10k times. This theory is reinforced by the fact that if I use a local database the execution time is much faster (but still slow), around 10s.
Is there a way to perform the updates in a parallel way? I was looking at session.bulk_update_mappings but the execution time is still the same. Maybe there are some parameters for the engine

2

Answers


  1. These do the same thing but the first one has to execute an UPDATE for each child whereas the second one does not need to do this.

    with Session(engine) as session:
        # Uses INSERT statement to create parent, gets id back with RETURNING
        parent = Parent()
        session.add(parent)
        session.flush()
    
        # Uses INSERT statement to create childs
        childs = [Child(), Child()]
        session.add_all(childs)
        session.flush()
    
        # Uses UPDATE statement to set parent_id
        parent.children.extend(childs)
        session.flush()
        session.commit()
    
    
    with Session(engine) as session:
        # Uses INSERT statement to create parent, gets id back with RETURNING
        parent = Parent()
        session.add(parent)
        session.flush()
    
        # Uses INSERT statement to create childs (with parent_id already set)
        childs = [Child(parent=parent), Child(parent=parent)]
        session.add_all(childs)
        session.flush()
    
        session.commit()
    

    Here is the echo output, it can be hard to read but you can see the inserts and updates in there.

    2023-07-26 14:28:00,947 INFO sqlalchemy.engine.Engine BEGIN (implicit)
    2023-07-26 14:28:00,952 INFO sqlalchemy.engine.Engine INSERT INTO parents DEFAULT VALUES RETURNING parents.id
    2023-07-26 14:28:00,953 INFO sqlalchemy.engine.Engine [generated in 0.00060s] {}
    2023-07-26 14:28:00,961 INFO sqlalchemy.engine.Engine INSERT INTO childs (parent_id) SELECT p0::INTEGER FROM (VALUES (%(parent_id__0)s, 0), (%(parent_id__1)s, 1)) AS imp_sen(p0, sen_counter) ORDER BY sen_counter RETURNING childs.deleted, childs.id, childs.id AS id__1
    2023-07-26 14:28:00,962 INFO sqlalchemy.engine.Engine [generated in 0.00029s (insertmanyvalues) 1/1 (ordered)] {'parent_id__0': None, 'parent_id__1': None}
    2023-07-26 14:28:00,980 INFO sqlalchemy.engine.Engine SELECT childs.deleted AS childs_deleted, childs.id AS childs_id, childs.parent_id AS childs_parent_id 
    FROM childs 
    WHERE %(param_1)s = childs.parent_id AND childs.deleted IS NULL
    2023-07-26 14:28:00,980 INFO sqlalchemy.engine.Engine [generated in 0.00070s] {'param_1': 1}
    2023-07-26 14:28:00,989 INFO sqlalchemy.engine.Engine UPDATE childs SET parent_id=%(parent_id)s WHERE childs.id = %(childs_id)s
    2023-07-26 14:28:00,990 INFO sqlalchemy.engine.Engine [generated in 0.00062s] [{'parent_id': 1, 'childs_id': 1}, {'parent_id': 1, 'childs_id': 2}]
    2023-07-26 14:28:00,993 INFO sqlalchemy.engine.Engine COMMIT
    2023-07-26 14:28:01,025 INFO sqlalchemy.engine.Engine BEGIN (implicit)
    2023-07-26 14:28:01,026 INFO sqlalchemy.engine.Engine INSERT INTO parents DEFAULT VALUES RETURNING parents.id
    2023-07-26 14:28:01,026 INFO sqlalchemy.engine.Engine [cached since 0.07392s ago] {}
    2023-07-26 14:28:01,032 INFO sqlalchemy.engine.Engine INSERT INTO childs (parent_id) SELECT p0::INTEGER FROM (VALUES (%(parent_id__0)s, 0), (%(parent_id__1)s, 1)) AS imp_sen(p0, sen_counter) ORDER BY sen_counter RETURNING childs.deleted, childs.id, childs.id AS id__1
    2023-07-26 14:28:01,032 INFO sqlalchemy.engine.Engine [cached since 0.07063s ago (insertmanyvalues) 1/1 (ordered)] {'parent_id__0': 2, 'parent_id__1': 2}
    2023-07-26 14:28:01,035 INFO sqlalchemy.engine.Engine COMMIT
    
    Login or Signup to reply.
  2. When you persist the child objects before they have been associated with a parent, you insert the rows into the child table with parent_id as NULL. That is,

    num_children = 3
    
    engine.echo = True
    t0 = time.perf_counter()
    with Session(engine) as sess:
        all_children = [Child() for i in range(num_children)]
        sess.add_all(all_children)
        sess.flush()
    

    generates

    INSERT INTO child (parent_id) SELECT p0::INTEGER FROM (VALUES (%(parent_id__0)s, 0), (%(parent_id__1)s, 1), (%(parent_id__2)s, 2)) AS imp_sen(p0, sen_counter) ORDER BY sen_counter RETURNING child.id, child.id AS id__1
    [generated in 0.00014s (insertmanyvalues) 1/1 (ordered)] {'parent_id__0': None, 'parent_id__1': None, 'parent_id__2': None}
    

    Then, when you associate them with a parent, SQLAlchemy has to update all of those child rows with actual parent_id.

        sess.add(
            Parent(name="parent_a", children=all_children)
        )
        sess.commit()
    

    SQLAlchemy creates the parent row

    INSERT INTO parent (name) VALUES (%(name)s) RETURNING parent.id
    [generated in 0.00045s] {'name': 'parent_a'}
    

    and then calls the DBAPI .executemany() method to update the child rows

    UPDATE child SET parent_id=%(parent_id)s WHERE child.id = %(child_id)s
    [generated in 0.00038s] [{'parent_id': 1, 'child_id': 1}, {'parent_id': 1, 'child_id': 2}, {'parent_id': 1, 'child_id': 3}]
    

    but a Wireshark trace shows that what psycopg2 actually sends to the server are separate UPDATE statements for each row

    UPDATE child SET parent_id=1 WHERE child.id = 1
    UPDATE child SET parent_id=1 WHERE child.id = 2
    UPDATE child SET parent_id=1 WHERE child.id = 3
    

    that is, one round-trip for each child row. That will obviously be slow for a large number of children.

    Instead, associate the newly-created child objects to the parent before flushing, which creates the parent and then creates each child with the appropriate parent_id

    with Session(engine) as sess:
        all_children = [Child() for i in range(num_children)]
        sess.add(
            Parent(name="parent_a", children=all_children)
        )
        sess.commit()
    

    which generates

    INSERT INTO parent (name) VALUES (%(name)s) RETURNING parent.id
    [generated in 0.00045s] {'name': 'parent_a'}
    INSERT INTO child (parent_id) SELECT p0::INTEGER FROM (VALUES (%(parent_id__0)s, 0), (%(parent_id__1)s, 1), (%(parent_id__2)s, 2)) AS imp_sen(p0, sen_counter) ORDER BY sen_counter RETURNING child.id, child.id AS id__1
    [generated in 0.00017s (insertmanyvalues) 1/1 (ordered)] {'parent_id__0': 1, 'parent_id__1': 1, 'parent_id__2': 1}
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search