I’m trying to execute the following query using sqlalchemy:
query = session.query(
func.count(ExampleTable.id)
).filter(ExampleTable.environment == self.environment))
if ids_list:
query = query.filter(ExampleTable.id.in_(ids_list))
I noticed it was slow, so i did a simple timing "benchmark":
start = perf_counter()
query.first()
print(format(perf_counter() - start, '.3f')) # .3f to limit decimal cases
- It does not have to be precise, it is just for debugging purposes.
The average result of this perf_counter is always close to 2.7 seconds.
I also thought it could be the ExampleTable.environment == self.environment
, since it is a string comparison, but the average result was around 0.03s, executing the same "benchmark" above with only the environment
comparison.
ExampleTable
has 150k ~ 200k total entries.ids_list
has 22k integer values.
With all that in mind, i did the same queries on MySQL Workbench (using the __ str __ method from Query object):
SELECT
COUNT(example_table.id) AS count_1
FROM
example_table
WHERE
example_table.environment = 'production'
AND example_table.id IN (775008, 790152, 793346, ...) -- 22k ids
The first execution, which is usually slower, took 0.110 sec / 0.000 sec
(duration / fetch)
The next executions took around 0.030 ~ 0.040 / 0.000
(duration / fetch)
SELECT
COUNT(example_table.id) AS count_1
FROM
example_table
WHERE
example_table.environment = 'production';
First execution: 0.016 sec / 0.000 sec
(duration / fetch)
Next executions: 0.000 / 0.000
(duration / fetch)
Since i’m running the same query on sqlalchemy and workbench, the duration should be at least close, right?
The difference between 2.7s~ to 0.100s~ is a LOT.
- Am i doing something wrong here? On python or mysql
- If the question lacks information, please let me know.
Edit1: The echo=True
shows not exactly what i’d expect:
This is the first log shown about the query i’m debugging:
2024-07-15 12:13:58,594 INFO sqlalchemy.engine.Engine SELECT count(example_table.id) AS count_1
FROM example_table
WHERE example_table.environment = %(environment_1)s AND example_table.id IN (%(id_1_1)s, %(id_1_2)s, %(id_1_3)s, ..., (id_1_21800)s, %(id_1_21801)s)
LIMIT %(param_1)s
All these references seem odd %(id_1_1)s, %(id_1_2)s, ...
And aparently the second one defines the references:
2024-07-15 12:46:10,584 INFO sqlalchemy.engine.Engine [generated in 0.01642s] {'id_1': 1, 'environment_1': 'production', 'param_1': 1, 'id_1_1': 335469, 'id_1_2': 412304, 'id_1_3': 468745, ..., 'id_1_21800': 365095, 'id_1_21801': 731841}
Could be this slowness from these defined variables? Its one per id
existing in .in_
clause
2
Answers
The problem was not the
.in_
itself, but with the engine url i was using:Just changed it to
mysql
instead ofmysql+mysqlconnector
This made the query time drop from
~2.7s
to~0.09s
I agree even taking into account any python string interpolation or something 2.7 seconds is a very large magnitude slower, 27 times.
A few things to try:
echo=True
is on for the engine and check that the SQL being submitted is actually what you expect. (ie. a single query and not multiple queries, no other queries are firing, the SQL looks like what you expect)text()
and pass it to sqlalchemy