skip to Main Content

I am currently working on a project where a wordcloud is created for the NYT. Therefore I am scraping all the articles from the NYT from their API, and then store dates of the article, the articles href and all the tokens that were mentioned inside that article with their amount in a mysql database.
The model looks like the following:

Date (dateID PK, publish_date Date)
Article (articleID PK, href char(300), dateID FK)
Token (tokenID PK, name char(100), articleID FK)

I have also indexed the publish_date and the tokens name which has already made the query twice as fast ish.

The job of the query is, to return up to 100 tokens with their total amount within all articles that were published within a specifies range of dates. It looks like this:

WITH 
    dateID as (
        SELECT dateID
        FROM date 
        WHERE publish_date >= '1999-12-31'
          AND publish_date <= '2023-12-31'
    ),
    articleIDs as (
        SELECT articleID
        FROM article
        WHERE dateID IN (TABLE dateID)
    )
SELECT t.name as name, sum(t.amount) as amount
FROM token t
WHERE t.articleID IN (TABLE articleIDs)
AND t.name > '@'
GROUP BY t.name
ORDER BY amount DESC
LIMIT 100;

FYI: It’s an innoDB hosted by aws with mysql version 8.0.33

I have already limited the timeframe fetched from the NYT and right now there are the years 2012 – 2023 loaded into the database.
With the current setup, the query takes around 100 seconds to complete with only 6,458,501 tokens. I have found, that the most inefficient part is the group by so I looked into optimization via index scan, but have found that this would not work for my usecase.

How can I make this as fast as possible? My goal would be something around 5 seconds.

EDIT: Execution plan:
Execution plan

2

Answers


  1. Sixty thousand rows a second is not pathologically slow for a query like this. If your application needs to use this result set frequently, you’ll need to keep a copy of it somewhere. For what it’s worth, the top 100 tokens used in almost a quarter century of historical news items will change very slowly, and your users will not notice if you only recompute this result set only once a day, or even once a week.

    If you want this to be more efficient, a good next step will be to eliminate your date table entirely, and store each publish_date in the article table instead. Then your query will look like this.

    WITH 
        articleIDs as (
            SELECT articleID
            FROM article
            WHERE publish_date >= '1999-12-31'
              AND publish_date <= '2023-12-31'
        )
    SELECT t.name as name, sum(t.amount) as amount
    FROM token t
    WHERE t.articleID IN (TABLE articleIDs)
    AND t.name > '@'
    GROUP BY t.name
    ORDER BY amount DESC
    LIMIT 100;
    

    There’s nothing to be gained by referring to your dates with dateID values the way you do it. SQL’s good at filtering by ranges of dates. Even if you do need a separate table of dates for some reason you haven’t mentioned, use its date values as its primary key rather than a dateID.

    An index on article.publish_date will help. And, you should try these two compound indexes on your token table. One of them will help, and the query plan will tell you which.

    CREATE INDEX name_id ON token(name, articleID);
    CREATE INDEX id_name ON token(articleID, name);
    
    Login or Signup to reply.
  2. Have you tried giving the Database more RAM? Maybe download more?
    Btw: Just use MongoDB!

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search