skip to Main Content

I have a users table with bio field and "n:n" relationship with itself via the followers table so each user U can follow many other users. My user search query is super slow.

  1. All queries get the first 20 search results (limit 20).
  2. Searching for users who have "founder" in their bio takes 0.3s.
  3. searching for users who follow X takes 0.03s.
  4. Searching for users who have "founder" in their bio AND follow X takes 118s.

Query for both filters:

select distinct `twitter_user`.`id`
from `twitter_user`
         join `twitter_user_follower`
              on (
                          `twitter_user_follower`.`follower_twitter_user_id` =
                          `twitter_user`.`id`
                      and `twitter_user_follower`.`twitter_user_id` = 4899565692
                      and `twitter_user_follower`.`follower_download_id` = 7064
                  )
where MATCH(twitter_user.description) AGAINST('founder')
limit 20 offset 0

Table definitions :

CREATE TABLE `twitter_user` (
  `id` bigint NOT NULL,
  `name` varchar(128) NOT NULL,
  `email` varchar(128) DEFAULT NULL,
  `screen_name` varchar(128) DEFAULT NULL,
  `location` varchar(256) DEFAULT NULL,
  `description` varchar(512) DEFAULT NULL,
  `url` varchar(256) DEFAULT NULL,
  `is_protected` bit(1) DEFAULT NULL,
  `followers_count` int DEFAULT NULL,
  `is_verified` bit(1) DEFAULT NULL,
  `friends_count` int DEFAULT NULL,
  `created_at` bigint DEFAULT NULL,
  `favourites_count` int DEFAULT NULL,
  `utc_offset` int DEFAULT NULL,
  `time_zone` varchar(128) DEFAULT NULL,
  `statuses_count` int DEFAULT NULL,
  `profile_image_url` varchar(512) DEFAULT NULL,
  `internal_json` json DEFAULT NULL,
  `row_timestamp` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `twitter_user_username_index` (`screen_name`),
  KEY `twitter_user_ts` (`row_timestamp`),
  FULLTEXT KEY `twitter_user_description_ft_index` (`description`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

CREATE TABLE `twitter_user_follower` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `twitter_user_id` bigint NOT NULL,
  `follower_twitter_user_id` bigint NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `follower_download_id` bigint DEFAULT NULL,
  `updated_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `twitter_user_follower_twitter_user_id_index` (`twitter_user_id`),
  KEY `twitter_user_follower_follower_download_id_index` (`follower_download_id`),
  KEY `tuf_twitter_user_follower_download_key` (`twitter_user_id`,`follower_download_id`,`follower_twitter_user_id`)
) ENGINE=InnoDB AUTO_INCREMENT=68494675 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Explain output :

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE twitter_user NULL fulltext PRIMARY,twitter_user_username_index,twitter_user_ts,twitter_user_description_ft_index twitter_user_description_ft_index 0 const 1 100.00 Using where; Ft_hints: no_ranking; Using temporary
1 SIMPLE twitter_user_follower NULL ref twitter_user_follower_twitter_user_id_index,twitter_user_follower_follower_download_id_index,tuf_twitter_user_follower_download_key tuf_twitter_user_follower_download_key 25 const,const,si_data_db.twitter_user.id 1 100.00 Using index; Distinct

Tree output:

-> Limit: 20 row(s)  (cost=4.77..4.77 rows=1)
-> Table scan on <temporary>  (cost=2.51..2.51 rows=1)
    -> Temporary table with deduplication  (cost=4.77..4.77 rows=1)
        -> Limit table size: 20 unique row(s)
            -> Nested loop inner join  (cost=2.16 rows=1)
                -> Filter: (match twitter_user.`description` against (''founder''))  (cost=1.06 rows=1)
                    -> Full-text index search on twitter_user using twitter_user_description_ft_index (description=''founder'')  (cost=1.06 rows=1)
                -> Limit: 1 row(s)  (cost=1.10 rows=1)
                    -> Covering index lookup on twitter_user_follower using tuf_twitter_user_follower_download_key (twitter_user_id=4899565692, follower_download_id=7064, follower_twitter_user_id=twitter_user.id)  (cost=1.10 rows=1)

This query is still slow:

SELECT `follower`.`follower_twitter_user_id`
FROM (
         SELECT `follower_twitter_user_id`
         FROM `twitter_user_follower`
         WHERE `twitter_user_id` = 4899565692
           AND `follower_download_id` = 7440
     ) AS follower
         JOIN `twitter_user` ON `follower`.`follower_twitter_user_id` =  `twitter_user`.`id`
WHERE MATCH(twitter_user.description) AGAINST('+founder' IN BOOLEAN MODE)
limit 20 offset 0;

Explain output:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE twitter_user NULL fulltext PRIMARY,twitter_user_description_ft_index twitter_user_description_ft_index 0 const 1 100.00 Using where; Ft_hints: no_ranking
1 SIMPLE twitter_user_follower NULL ref twitter_user_follower_twitter_user_id_index,twitter_user_follower_follower_download_id_index,tuf_twitter_user_follower_download_key tuf_twitter_user_follower_download_key 25 const,const,si_data_db.twitter_user.id 1 100.00 Using index

Explain Analyze output:

-> Limit: 20 row(s)  (cost=2.16 rows=1) (actual time=3779.933..91032.297 rows=20 loops=1)
    -> Nested loop inner join  (cost=2.16 rows=1) (actual time=3779.932..91032.285 rows=20 loops=1)
        -> Filter: (match twitter_user.`description` against ('+founder' in boolean mode))  (cost=1.06 rows=1) (actual time=94.166..90001.280 rows=198818 loops=1)
            -> Full-text index search on twitter_user using twitter_user_description_ft_index (description='+founder')  (cost=1.06 rows=1) (actual time=94.163..89909.371 rows=198818 loops=1)
        -> Covering index lookup on twitter_user_follower using tuf_twitter_user_follower_download_key (twitter_user_id=4899565692, follower_download_id=7440, follower_twitter_user_id=twitter_user.id)  (cost=1.10 rows=1) (actual time=0.005..0.005 rows=0 loops=198818)

users table is 125GB and followers table is 5GB on disk. If I convert query to two selects joined, it runs in 45s:

select t1.id from
(select follower_twitter_user_id as id from `twitter_user_follower`
 where (
                   `twitter_user_follower`.`twitter_user_id` = 4899565692
               and `twitter_user_follower`.`follower_download_id` = 8039
           )) t1
inner join
(
    select `twitter_user`.`id`
    from `twitter_user` where MATCH(twitter_user.description) AGAINST('+create' IN BOOLEAN MODE)
) t2 on t1.id = t2.id
limit 20 offset 0

Explain output:

 -> Limit: 20 row(s)  (cost=2.18 rows=1)
    -> Nested loop inner join  (cost=2.18 rows=1)
        -> Filter: (match twitter_user.`description` against ('+create' in boolean mode))  (cost=1.08 rows=1)
            -> Full-text index search on twitter_user using twitter_user_description_ft_index (description='+create')  (cost=1.08 rows=1)
        -> Covering index lookup on twitter_user_follower using tuf_twitter_user_follower_download_key (twitter_user_id=4899565692, follower_download_id=8039, follower_twitter_user_id=twitter_user.id)  (cost=1.10 rows=1)

Why does it take 45s to run?

3

Answers


  1. Try the following. CHange

    MATCH(twitter_user.description) AGAINST('founder')
    

    to

    MATCH(twitter_user.description) AGAINST('+founder' IN BOOLEAN MODE)
    

    Also, the DISTINCT may not be necessary.

    IODKU?

    Buried in a comment I see a DELETE + INSERT that makes for a lot of churn in part of the table.

    • InnoDB’s FULLTEXT may not be efficient in such cases
    • If most of the rows aren’t changing, then the delete+insert is inefficient and cause more churn than may be necessary.

    Look at INSERT ... ON DUPLICATE KEY UPDATE ... to see if that can be used instead of the delete+insert. If most of the rows don’t change, then this will probably be faster and probably have less impact on things such as the Fulltext index.

    If that Delete does actually remove some rows, then the IODKU (upsert) won’t suffice. A second pass using something like INSERT ... SELECT ... LEFT JOIN may be the solution for inserting "new" rows. (I allude to such in a different context here: Normalization; see SQL#1.)

    Periodically (weekly?), run OPTIMIZE TABLE. But keep some timings to see if this step really helps.

    2 steps

    First, I am still unclear about the data you receive every hour. Is it just info about one user? Does it include rows to delete with some indication that they are to be deleted instead of updated? Etc.

    If it is a single user…

    • DELETE only the rows that need deleting. This involves a multi-table Delete with a LEFT JOIN to see what is missing.
    • INSERT ... SELECT ... LEFT JOIN ... to either insert or update the existing rows.
    Login or Signup to reply.
  2. can you please try this and post the EXPLAIN to us.

    SELECT `follower`.`follower_twitter_user_id`
    FROM (
      SELECT `follower_twitter_user_id`
      FROM `twitter_user_follower`
      WHERE `twitter_user_id` = 4899565692
        AND `follower_download_id` = 7064
    ) AS follower
    JOIN `twitter_user` ON `follower`.`follower_twitter_user_id` =  `twitter_user`.`id`
    MATCH(twitter_user.description) AGAINST('+founder' IN BOOLEAN MODE)
    limit 20 offset 0;
    
    Login or Signup to reply.
  3. An option to try, in an attempt to minimise overheads and maximise any short-circuiting

    SELECT
       `twitter_user`.`id`
    FROM
      `twitter_user`
    WHERE
      MATCH(twitter_user.description) AGAINST ('+founder' IN BOOLEAN MODE)
      AND
      EXISTS (
        SELECT
          *
        FROM
          `twitter_user_follower`
        WHERE
              `follower_twitter_user_id` = `twitter_user`.`id
          AND `twitter_user_id` = 4899565692
          AND `follower_download_id` = 7064
      )
    LIMIT 20
    OFFSET 0 
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search