I am currently working on a project where a wordcloud is created for the NYT. Therefore I am scraping all the articles from the NYT from their API, and then store dates of the article, the articles href and all the tokens that were mentioned inside that article with their amount in a mysql database.
The model looks like the following:
Date (dateID PK, publish_date Date)
Article (articleID PK, href char(300), dateID FK)
Token (tokenID PK, name char(100), articleID FK)
I have also indexed the publish_date
and the tokens name
which has already made the query twice as fast ish.
The job of the query is, to return up to 100 tokens with their total amount within all articles that were published within a specifies range of dates. It looks like this:
WITH
dateID as (
SELECT dateID
FROM date
WHERE publish_date >= '1999-12-31'
AND publish_date <= '2023-12-31'
),
articleIDs as (
SELECT articleID
FROM article
WHERE dateID IN (TABLE dateID)
)
SELECT t.name as name, sum(t.amount) as amount
FROM token t
WHERE t.articleID IN (TABLE articleIDs)
AND t.name > '@'
GROUP BY t.name
ORDER BY amount DESC
LIMIT 100;
FYI: It’s an innoDB
hosted by aws with mysql version 8.0.33
I have already limited the timeframe fetched from the NYT and right now there are the years 2012 – 2023 loaded into the database.
With the current setup, the query takes around 100 seconds to complete with only 6,458,501 tokens. I have found, that the most inefficient part is the group by so I looked into optimization via index scan, but have found that this would not work for my usecase.
How can I make this as fast as possible? My goal would be something around 5 seconds.
EDIT: Execution plan:
2
Answers
Sixty thousand rows a second is not pathologically slow for a query like this. If your application needs to use this result set frequently, you’ll need to keep a copy of it somewhere. For what it’s worth, the top 100 tokens used in almost a quarter century of historical news items will change very slowly, and your users will not notice if you only recompute this result set only once a day, or even once a week.
If you want this to be more efficient, a good next step will be to eliminate your
date
table entirely, and store eachpublish_date
in thearticle
table instead. Then your query will look like this.There’s nothing to be gained by referring to your dates with
dateID
values the way you do it. SQL’s good at filtering by ranges of dates. Even if you do need a separate table of dates for some reason you haven’t mentioned, use itsdate
values as its primary key rather than adateID
.An index on
article.publish_date
will help. And, you should try these two compound indexes on yourtoken
table. One of them will help, and the query plan will tell you which.Have you tried giving the Database more RAM? Maybe download more?
Btw: Just use MongoDB!