skip to Main Content

I am having performance issues with a query, I have 21 million records across the table, and 2 of the tables I’m looking in here have 8 million each; individually, they are very quick. But I’ve done a query that, in my opinion, isn’t very good, but it’s the only way I know how to do it.

This query takes 65 seconds, I need to get it under 1 second and I think it’s possible if I don’t have all the SELECT queries, but once again, I am not sure how else to do it with my SQL knowledge.

Database server version is MariaDB 10.6.

SELECT
pa.`slug`,
(
    SELECT 
    SUM(`impressions`) 
    FROM `rh_pages_gsc_country` 
    WHERE `page_id` = pa.`page_id` 
    AND `country` = 'aus'
    AND `date_id` IN 
        (
            SELECT `date_id` 
            FROM `rh_pages_gsc_dates` 
            WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
        )
) as au_impressions,
(
    SELECT 
    SUM(`clicks`) 
    FROM `rh_pages_gsc_country` 
    WHERE `page_id` = pa.`page_id` 
    AND `country` = 'aus'
    AND `date_id` IN 
        (
            SELECT `date_id` 
            FROM `rh_pages_gsc_dates` 
            WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
        )
) as au_clicks,
(
    SELECT 
    COUNT(`keywords_id`) 
    FROM `rh_pages_gsc_keywords` 
    WHERE `page_id` = pa.`page_id`
    AND `date_id` IN 
        (
            SELECT `date_id` 
            FROM `rh_pages_gsc_dates` 
            WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
        )
) as keywords,
(
    SELECT 
    AVG(`position`) 
    FROM `rh_pages_gsc_keywords` 
    WHERE `page_id` = pa.`page_id`
    AND `date_id` IN 
        (
            SELECT `date_id` 
            FROM `rh_pages_gsc_dates` 
            WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
        )
) as avg_pos,
(
    SELECT 
    AVG(`ctr`) 
    FROM `rh_pages_gsc_keywords` 
    WHERE `page_id` = pa.`page_id`
    AND `date_id` IN 
        (
            SELECT `date_id` 
            FROM `rh_pages_gsc_dates` 
            WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
        )
) as avg_ctr
FROM `rh_pages` pa
WHERE pa.`site_id` = 13
ORDER BY au_impressions DESC, keywords DESC, slug DESC

If anyone can help, I don’t think the table structure is needed here as it’s basically shown in the query, but here is a photo of the constraints and table types.

SQL Tables

Anyone that can help is greatly appreciated.

2

Answers


  1. Your query is aggregating (summarizing) rows from two different detail tables, rh_pages_gsc_country and rh_pages_gsc_keywords, and doing so for a particular date range. And it has a lot of correlated subqueries.

    The first steps in your path to better performance are

    • Converting your correlated subqueries to independent subqueries, then JOINing them.
    • Writing one subquery for each detail table, rather than one for each column you need summarized.

    You mentioned you’ve been struggling with this. The concept I hope you learn from this answer is this: you can often refactor away your correlated subqueries if you can come up with independent subqueries that give the same results, and then join them together. If you mention subqueries in your SELECT clause — SELECT ... (SELECT whatever) whatever ... — you probably have an opportunity to do this refactoring.

    Here goes. First you need a subquery for your date range. You have this one right, just repeated.

               SELECT `date_id` 
                FROM `rh_pages_gsc_dates` 
                WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
    

    Next you need a subquery for rh_pages_gsc_country. It is a modification of what you have. We’ll fetch both SUMs in one subquery.

          SELECT SUM(`impressions`) impressions,
                 SUM(`clicks`) clicks, 
                 page_id, date_id
            FROM `rh_pages_gsc_country` 
           WHERE `country` = 'aus'
          GROUP BY page_id, date_id
    

    See how this goes? This subquery yields a virtual table with exactly one row for every combination of page_id and date_id, containing the number of impressions and the number of clicks.

    Next, let’s join the subqueries together in a main query. This yields some columns of your result set.

    SELECT pa.slug, country.impressions, country.clicks
      FROM rh_pages pa 
      JOIN (
               SELECT SUM(`impressions`) impressions,
                      SUM(`clicks`) clicks, 
                      page_id, date_id
                 FROM `rh_pages_gsc_country` 
                WHERE `country` = 'aus'  -- constant for country code
                GROUP BY page_id, date_id
          ) country ON  country.page_id = pa.page_id
      JOIN (
               SELECT `date_id` 
                FROM `rh_pages_gsc_dates` 
                WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
           ) dates ON dates.date_id = country.date_id
     WHERE pa.site_id = 13             -- constant for page id
     ORDER BY country.impressions DESC
    

    This runs through the rows of rh_pages_gsc_dates and rh_pages_gsc_country just once to satisfy your query. So, faster.

    Finally let’s do the same thing for your rh_pages_gsc_keywords table’s summary.

    SELECT pa.slug, country.impressions, country.clicks,
           keywords.keywords, keywords.avg_pos, keywords.avg_ctr
      FROM rh_pages pa 
      JOIN (
               SELECT SUM(`impressions`) impressions,
                      SUM(`clicks`) clicks, 
                      page_id, date_id
                 FROM `rh_pages_gsc_country` 
                WHERE `country` = 'aus'  -- constant for country code
                GROUP BY page_id, date_id
          ) country ON  country.page_id = pa.page_id
      JOIN (
               SELECT SUM(`keywords_id`) keywords,
                      AVG(`position`) position,
                      AVG(`ctr`) avg_ctr, 
                      page_id, date_id
                 FROM `rh_pages_gsc_keywords` 
                GROUP BY page_id, date_id
           ) keywords ON keywords.page_id = pa.page_id
      JOIN (
               SELECT `date_id` 
                FROM `rh_pages_gsc_keywords` 
                WHERE `date` BETWEEN NOW() - INTERVAL 12 MONTH AND NOW()
           ) dates ON dates.date_id = country.date_id
                  AND dates.date_id = keywords.date_id
     WHERE pa.site_id = 13             -- constant for page id
     ORDER BY impressions DESC, keywords DESC, slug DESC
    

    This will almost certainly be faster than what you have now. If it’s fast enough, great. If not, please don’t hesitate to ask another question for help, tagging it . We will need to see your table definitions, your index definitions, and the output of EXPLAIN. Please read this before asking a followup question.

    I did not, repeat not, debug any of this. That’s up to you.

    Login or Signup to reply.
  2. Do NOT normalize any column that will be regularly used in a "range scan", such as date. The following is terribly slow:

      AND  `date_id` IN (
                    SELECT  `date_id`
                        FROM  `rh_pages_gsc_dates`
                        WHERE  `date` BETWEEN NOW() - INTERVAL 12 MONTH
                                          AND NOW() ) 
    

    It also consumes extra space to have BIGINT (8 bytes) pointing to a DATE (5 bytes).

    Once you move the date to the various tables, the subqueries simplify, such as

            SELECT  AVG(`position`)
                FROM  `rh_pages_gsc_keywords`
                WHERE  `page_id` = pa.`page_id`
                  AND  `date_id` IN (
                    SELECT  `date_id`
                        FROM  `rh_pages_gsc_dates`
                        WHERE  `date` BETWEEN NOW() - INTERVAL 12 MONTH
                                          AND NOW() ) 
    

    becomes

            SELECT  AVG(`position`)
                FROM  `rh_pages_gsc_keywords`
                WHERE  `page_id` = pa.`page_id`
                  AND  `date` >= NOW() - INTERVAL 12 MONTH
    

    I’m assuming that nothing after "NOW" has yet been stored.
    If there are dates in the future, then add

                  AND  `date`  < NOW()
    

    Each table will probably need a new index, such as

    INDEX(page_id, date)  -- in that order
    

    (Yes, the "JOIN" suggestion by others is a good one. It’s essentially orthogonal to my suggestions above and below.)

    After you have made those changes, if the performance is not good enough, we can discuss Summary Tables

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search