I have a request that I need to process multiple time during specific time frames, my implementation is working, but my user base is growing everyday and the CPU load of my database and the time taken to execute the query is getting bigger everyday
Here is the request:
SELECT bill.* FROM billing bill
INNER JOIN subscriber s ON (s.subscriber_id = bill.subscriber_id)
INNER JOIN subscription sub ON(s.subscriber_id = sub.subscriber_id)
WHERE s.status = 'C'
AND bill.subscription_id = sub.subscription_id
AND sub.renewable = 1
AND (hour(sub.created_at) > 1 AND hour(sub.created_at) < 5 )
AND sub.store = 'BizaoStore'
AND (sub.purchase_token = 'myservice' or sub.purchase_token = 'myservice_wait' )
AND bill.billing_date > '2022-12-31 07:00:00' AND bill.billing_date < '2023-01-01 10:00:00'
AND (bill.billing_value = 'not_ok bizao_tobe' or bill.billing_value = 'not_ok BILL010 2' or bill.billing_value = 'not_ok BILL010' or bill.billing_value = 'not_ok BILL010 3')
AND (SELECT MAX(bill2.billing_date)
FROM billing bill2
WHERE bill2.subscriber_id = bill.subscriber_id
AND bill2.subscription_id = bill.subscription_id
AND bill2.billing_value = 'not_ok bizao_tobe')
= bill.billing_date order by sub.created_at DESC LIMIT 300;
This request is executed in two different servers, each serve handle a specific service.
In each server, the request runs 8 times per minut (for around 3 hours) and
each of the 8 times has this line with different hours:
AND (hour(sub.created_at) > 1 AND hour(sub.created_at) < 5 )
I did this so I can split my user base in 8 and process the requests more efficiently.
Also I need to only handle 300 users at a time because the third party server I have to call for each user is not very stable and can sometime take very long to respond
The billing tables counts around 50.000.000 entries, here is the schema of the columns and indexes:
Subscriber table is around 2.000.000, columns scheme and indexes:
And finally subscription table, 2.500.000 rows, scheme and indexes:
As a few more infos, I noticed during my tests for optimisation that if I add in my request the fact that I want datas with a "billing_id" over a specific ID, it will run very fast. Basically I think whats taking the most time is parsing the 50.000.000 lines tables.
I did (or at least I tried) to optimize my request with time to be more efficient, but as of now I’m a little bit stuck with it.
Mysql version is 5.7.38
Thanks for your help
2
Answers
I think there could be some rooms for improvement, some are easier to do some takes more efforts:
AND (hour(sub.created_at) > 1 AND hour(sub.created_at) < 5 )
, howeversub.created_at
does not have an index. It is a good ideacreated_at
andupdated_at
columns always have indexes on in the database.bill.billing_date
MAX(bill2.billing_date)
. This can be a bottleneck too, as this subquery needs to run for each row of the previous stage of execution (you can check that withexplain
query). Depending on your condition you can store that max field somewhere, like a column in database, cache, etc. and keep updating it every time the underlying table’s values change. You can do it with database triggers (which some people consider it anti pattern) or use the transaction inside your code.Having all said, the first point might improve the performance a lot, and hence making the rest not necessary
I see a few opportunities to speed up this query. (Reference: Markus Winand’s https://use-the-index-luke.com/ e-book.)
SELECT MAX(bill2.billing_date)...
) with an independent subquery.WHERE
clauses sargable — able to exploit indexes.Independent subquery Obtain latest billing date for each subscriber / subscription like this. This query need only run once, whereas the correlated subquery you have runs many times.
Use this index to speed up the subquery. This index allows the subquery to be satisfied with an almost miraculously fast loose index scan.
Then rewrite your overall query like this to use it. I have rewritten a few other things here as well to improve readability: mostly changing
col = a OR col = b OR col = c
tocol IN (a,b,c)
. I also changed the order of some WHERE clauses, again for readability. The order of WHERE clauses doesn’t matter to performance.Sargability Your segmenting of your user base by hours of the day means you need this clause, as you pointed out.
This clause applies the HOUR() function to every eligible row so it has to scan through the rows looking at them all. Slow. Add a virtual column called
created_hour
to yoursubscription
table. We’ll put an index on the column in a moment.Then start using the virtual column to segment your users.
Indexes Compound (multicolumn) indexes are the way to speed up complex queries like yours. The order of columns matters in indexes. The columns with equality matches go first, then a column with a range match.
First, let’s put a compound index on your
subscription
table that matches the requirements of your query. We’ll index our new virtual column in the process. This lets the query planner find your batches efficiently.The last column in this index is
created_at
. That speeds up yourORDER BY ... LIMIT
operation.Next, let’s look at how you use
billing
in your main query. (We already added an index to help the subquery). You match for equality onbilling_value
and then by a date range onbilling_date
. So you need this index.You already have the index you need on
subscriber
.