skip to Main Content

I have a large MySQL table and when it had a low number of records the searches were fast, however the table now has over 400,000 records and searches take less than a second between 0.60 and 0.75. I’ve tried using indexes to bring this down to a nearer the 0.10 second or at least lower than 0.60 with no success. This estimate was from using microtime in PHP immediately before and after the SQL query.

This is a snippet of the table structure output from SHOW CREATE TABLE, there are other fields in the table but these aren’t used in the SQL search statement.

CREATE TABLE `mytable` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `url` longtext COLLATE utf8_unicode_ci NOT NULL,
  `url_sha512hash` char(128) COLLATE utf8_unicode_ci NOT NULL DEFAULT 'no hash.',
  `viewdate` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  PRIMARY KEY (`id`),
  UNIQUE KEY `uniq-web_url` (`url_sha512hash`),
  KEY `idx-viewdate` (`viewdate`)
) ENGINE=MyISAM AUTO_INCREMENT=404899 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

This is the SQL statement which takes between 60 and 75 seconds that I would like to speed up.

SELECT `id`, `url` FROM `mytable` USE INDEX(`idx-viewdate`)
  WHERE (`url` LIKE "https://www.domain.tld/path/to/dir"
   AND `viewdate` > "2022-11-20 23:23:00")
   OR (`url` LIKE "https://www.domain.tld/path/to/dir/" AND `viewdate` > "INSERT SAME VIEW DATE AS BEFORE")
  ORDER BY `id` DESC;

Output from EXPLAIN

+------+-------------+---------+-------+---------------+--------------+---------+------+------+----------------------------------------------------+
| id   | select_type | table   | type  | possible_keys | key          | key_len | ref  | rows | Extra                                              |
+------+-------------+---------+-------+---------------+--------------+---------+------+------+----------------------------------------------------+
|    1 | SIMPLE      | mytable | range | idx-viewdate  | idx-viewdate | 4       | NULL |   28 | Using index condition; Using where; Using filesort |
+------+-------------+---------+-------+---------------+--------------+---------+------+------+----------------------------------------------------+

3

Answers


  1. Create a single index on url, viewdate, id.

    • This will minimise the rows you Must search to find the matches.

    As per comments, use InnoDB, not MyISAM.

    Then use this WHERE clause…

    WHERE
      viewdate > ?
      AND url IN (?, CONCAT(?,'/'))
    

    LIKE is expensive, and unnecessary if you’re looking for exact matches.

    IN() is still more expensive than =, but this form makes it clear you’re looking for one or two exact matches.

    Finally, be VERY certain to use parameterises queries / prepared statements. Do NOT manually substitute strings in to your query using string manipulation in PHP.

    EDIT

    Or, put the hash column in the index, and hash the two forms of the URL parameters when searching.

    ALTER TABLE `mysql` ADD INDEX `idx-visited2` (`url_sha512hash`, `viewdate`, `id`)
    
    WHERE
      viewdate > DATE
      AND url_sha512hash IN (HASHED_URL, HASHED_URL_WITH_SLASH)
    
    Login or Signup to reply.
  2. What version of MySQL are you running? Do you have control of the server or is it shared hosting?

    The output from the following queries would be helpful for context –

    SELECT VERSION();
    ANALYZE TABLE `mytable`;
    SHOW INDEX FROM `mytable`;
    SHOW TABLE STATUS LIKE 'mytable';
    SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
    

    The output of the following query will help with understanding the distribution your url data (may be slow so be cautious if server under load) –

    SELECT
        COUNT(*) countAll,
        COUNT(IF(LENGTH(url) <= 32, 1, NULL)) count32,
        COUNT(DISTINCT LEFT(url, 32)) distinct32,
        COUNT(IF(LENGTH(url) <= 48, 1, NULL)) count48,
        COUNT(DISTINCT LEFT(url, 48)) distinct48,
        COUNT(IF(LENGTH(url) <= 64, 1, NULL)) count64,
        COUNT(DISTINCT LEFT(url, 64)) distinct64,
        COUNT(IF(LENGTH(url) <= 80, 1, NULL)) count80,
        COUNT(DISTINCT LEFT(url, 80)) distinct80,
        COUNT(IF(LENGTH(url) <= 96, 1, NULL)) count96,
        COUNT(DISTINCT LEFT(url, 96)) distinct96
    FROM mytable;
    

    As suggested by @MatBailie, you should add a composite index on url and viewdate, but do not include id.

    When adding an index on TEXT/BLOB columns you must specify the key length (number of chars to be indexed). A longtext column can hold values up to 4GB, which would not be good for an index.

    You also need to change the data type for your url column to a varchar. The current maximum length is 237 so bring it down to whatever you are happy with. 256 would give you some headroom but you may be more comfortable with 300.

    ALTER TABLE `mytable` 
        CHANGE COLUMN `url` `url` VARCHAR(300) NOT NULL,
        ADD INDEX `idx_url_viewed` (`url`(64), `viewdate`);
    

    Please provide updated EXPLAIN output for your query, after applying the changes.

    UPDATE

    You are currently storing your hash in a multi-byte column which is relatively inefficient. I would suggest changing it to a binary column which will use less space and be more performant for index lookups.

    # Add the binary column
    ALTER TABLE `mytable`
        CHANGE COLUMN `url` `url` VARCHAR(300) NOT NULL,
        ADD COLUMN `url_bin_hash` BINARY(64);
    
    # Populate it
    UPDATE `mytable` SET `url_bin_hash` = UNHEX(`url_sha512hash`);
    
    # Drop the old column, change the new one to NOT NULL, and add index
    ALTER TABLE `mytable`
        DROP COLUMN `url_sha512hash`,
        CHANGE COLUMN `url_bin_hash` `url_bin_hash` BINARY(64) NOT NULL,
        ADD UNIQUE KEY `uq_url_bin_hash` (`url_bin_hash`);
    

    Then, when inserting into the new binary hash you need to remember to –

    INSERT INTO `mytable` (`url`, `url_bin_hash`, `viewdate`)
        VALUES ('url', UNHEX(SHA2('url', 512)), NOW());
    

    And to search based on the new binary hash –

    SELECT `id`, `url`
    FROM `mytable`
    WHERE `viewdate` > ?
    AND `url_bin_hash` IN (UNHEX(SHA2(?, 512)), UNHEX(SHA2(CONCAT(?, '/'), 512)))
    ORDER BY `id` DESC;
    

    Adding viewdate to the index on the binary hash column would be a waste of space as it is going to have to hit the clustered index to retrieve the url (and possibly other columns) anyway, unless you have a high miss-rate (queries returning 0 rows).

    If the id column is not being used to reference mytable from elsewhere, you could ditch the surrogate PK and use the binary hash instead, as you already have a unique index on it.

    Login or Signup to reply.
  3. Some ideas:

    • Get rid of id; promote the hash to be the PK.

    • Switch from ORDER BY id to ORDER BY viewdate

    • Change from OR to UNION

    • url REGEXP "INSERT SAME URL AS BEFORE/?" instead of OR. Note the "?".

    • I agree that you should switch to InnoDB; the suggestions here depend on that change.

    • Since REGEXP is not optimized as well as LIKE, toss the OR and use

      WHERE URL LIKE   'blah%'
        AND URL REGEXP 'blah/?'
      

    Putting them together…

    Final query. (Note id is gone)

    Plan A

    SELECT url, view_date
      FROM mytable
      WHERE view_date > '2022...'
        AND URL LIKE   'blah%'
        AND URL REGEXP 'blah/?'
      ORDER BY view_date DESC
    

    and have

    PRIMARY KEY(url_sha512hash)
    INDEX(view_date, URL(191))
    

    Plan B

    Since you are really looking at two possible URLs, not lots, then

    ( SELECT id, url, view_date
          FROM mytable
          WHERE view_date > '2022...'
            AND url_sha512hash = sha256('blah')   -- w/0 slash
    ) UNION ALL
    ) SELECT id, url, view_date
          FROM mytable
          WHERE view_date > '2022...'
            AND url_sha512hash = sha256('blah/')   -- with slash
    )
    ORDER BY view_date DESC
    

    together with

    PRIMARY KEY(url_sha512hash, view_date)
    

    Plan C

    But even better is to remove the trailing "/" when INSERTing (at least for the hash) and when SELECTing. (Or require the trailing "/" as suggested by MatBailie.) Either way, there is only one thing to test:

    SELECT id, url, view_date
          FROM mytable
          WHERE view_date > '2022...'
            AND url_sha512hash = sha256('blah')
    ORDER BY view_date DESC
    

    together with

    PRIMARY KEY(url_sha512hash, view_date)
    

    Plan C may get it down to 10ms on a spinning HDD or 1ms on SSD. A and B might be half that fast.

    If you need the id for other purposes, tack on

    INDEX(id)   -- to keep `AUTO_INCREMENT` happy.
    

    And, yes, convert the hex to BINARY(nn) with UNHEX. (I did not incorporate such.)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search