I have a query over two tables — matchoverview
id, home_id, away_id, date, season, result
matchattributes
id, game_id, attribute_id, attribute_value
My query
select m.id from matchOverview m
join matchAttributes ma on ma.match_id=m.id and ma.attribute_id in (3,4,5,6)
group by m.id
having sum(case when ma.attribute_id in (3,4)
then ma.attribute_value end) > 3
or sum(case when ma.attribute_id in (5,6)
then ma.attribute_value end) > 3;
Which returns all match ids where the sum of attributes 3 and 4 or 5 and 6 is greater than 3.
This particular query returns 900k rows, unsurprisingly in phpmyadmin this query takes a deal of time, as I imagine it needs to format the results into a table, but it clocks the query at .0113 seconds.
Yet when I make this query over PHP it takes 15 seconds, if I alter the query to LIMIT to only 100 results, it runs almost instantly, leaving me with the belief the only possibility being the amount of data being transferred is what is slowing it.
But would it really take 15 seconds to transfer 1M 4 byte ints over the network?
Is the only solution to further limit the query so that it returns less results?
EDIT
Results of an EXPLAIN on my query
id select_type table type key key key_len ref rows Extra
1 SIMPLE m index PRIMARY PRIMARY 4 NULL 2790717 Using index
1 SIMPLE ma ref match,attribute match 4 opta_matches2.m.id 2 Using where
How I am timing my SQL query
$time_pre = microtime(true);
$quer = $db->query($sql);
$time_post = microtime(true);
$exec_time = $time_post - $time_pre;
Data from slow query log
# Thread_id: 15 Schema: opta_matches2 QC_hit: No
# Query_time: 15.594386 Lock_time: 0.000089 Rows_sent: 923962 Rows_examined: 15688514
# Rows_affected: 0 Bytes_sent: 10726615
I am ok with dealing with a 15 second query if it is because that is how long it takes the data to move over the network, but if the query or my table can be optimized that is the best solution
The row count is not the issue, the following query
select m.id from matchOverview m
join matchAttributes ma on ma.match_id=m.id and ma.attribute_id in (1,2,3,4)
group by m.id
having sum(case when ma.attribute_id in (3,4)
then ma.attribute_value end) > 8
and sum(case when ma.attribute_id in (1,2)
then ma.attribute_value end) = 0;
returns only 24 rows but also takes ~15 seconds
3
Answers
phpMyAdmin doesn’t give you all results,
it also using limit to default 25 results.
If you change this limit by changing "Number of rows" select box or type the limit in query, It will take more time to run the query.
I think if you rewrote the conditions, at a minimum you might find something out. For instance, I think this does the same as the second example (the 24 results one);
It’s more verbose, but it could help triangulate where the bottleneck(s) come from more readily.
For instance, if (a working) version of the above is still slow, then run the inner query with the
GROUP BY
intact. Still slow? Remove theGROUP BY
. Move theGROUP BY/SUM
into the outer query, what happens?That kinda thing. I can’t run it so I can’t work out a more precise answer, which I would like to know.
There are probably two significant parts to the timing: Locate the rows and decide which ids to send; then send them. I will address both.
Here’s a way to better separate the elapsed time for just the query (and not the network):
SELECT COUNT(*) FROM (...) AS x;
Where ‘…’ is the 1M-row query.Speeding up the query
Since you aren’t really using
matchoverview
, let’s get rid of it:And have a composite index with the columns in this order:
As for the speedy
LIMIT
, that is because it can stop short. But aLIMIT
without anORDER BY
is rather meaningless. If you add anORDER BY
, it will have to gather all the results, sort them, and finally perform theLIMIT
.Network transfer time
Transferring millions of rows (I see 10.7MB in the slowlog) over the network is time-consuming, but takes virtually no CPU time.
One
EXPLAIN
implies that there might be 2.8M rows; is that about correct? The slowlog says that about 16M rows are touched — this may be because of the two tables, join, group by, etc. My reformulation and index should decrease the 16M significantly, hence decrease the elapsed time (before the network transfer time).923K rows "sent" — What will the client do with that many rows. In general, I find that more than a few thousand rows "sent" indicates poor design.
"take 15 seconds to transfer 1M 4 byte ints over the network" — That is elapsed time, and cannot be sped up except by sending fewer rows. (BTW, it is probably sent as strings of several digits, plus overhead for each row; I don’t whether the 10726615 is actual network bytes or counts only the ints.)
"the ids are used in an internal calculation" — How do you calculate with ids? If you are looking up the ids in some other place, perhaps you can add complexity to the query, thereby doing more work before hitting the network; then shipping less data?
If you want to discuss further, please provide
SHOW CREATE TABLE
. (It may have some details that don’t show up in your simplified table definition.)