skip to Main Content

I have to retrieve a list of postings, by one or more category IDs.
I do not want to have duplicate postings in my results.

I’m only interested in responses that directly relate, or can draw relations with MySQL 8

There are two queries I’m considering, and I’m deciding which one is better. Or, if there is a better "3rd query", please advise.

Consider a simple two table structure:

CREATE TABLE `job_category_posting` (
  `category_posting_id` int UNSIGNED NOT NULL,
  `category_posting_category_id` int UNSIGNED NOT NULL,
  `category_posting_posting_id` int UNSIGNED NOT NULL,
  `category_posting_is_primary_category` tinyint UNSIGNED DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

CREATE TABLE `job_posting` (
  `posting_id` int UNSIGNED NOT NULL,
  `posting_title` varchar(250) NOT NULL,
  `posting_body` mediumtext CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

ALTER TABLE `job_category_posting`
  ADD PRIMARY KEY (`category_posting_id`),
  ADD UNIQUE KEY `category_posting_category_id` (`category_posting_category_id`,`category_posting_posting_id`),
  ADD UNIQUE KEY `category_posting_is_primary_category` (`category_posting_is_primary_category`,`category_posting_posting_id`),
  ADD KEY `category_posting_posting_id` (`category_posting_posting_id`) USING BTREE;

ALTER TABLE `job_posting`
  ADD PRIMARY KEY (`posting_id`),
  ADD UNIQUE KEY `posting_reserve_id` (`posting_reserve_id`),
  ADD KEY `posting_title` (`posting_title`);

1st query (SUBQUERY with GROUP BY):

SELECT t1.*
FROM job_posting AS t1
WHERE (t1.posting_id) IN(
   SELECT category_posting_posting_id
   FROM job_category_posting
   WHERE category_posting_category_id IN (2,13,22,23,24,25)
   GROUP BY category_posting_posting_id
)

Quick dirty speed tests (doesn’t tell me much):

  • 0.0017 seconds
  • 0.0016 seconds
  • 0.0011 seconds
  • 0.0017 seconds

EXPLAIN gives me this:
1st query using subquery

What I noticed:

  • Query plan has gone through quite a number of rows (2356 + 1 + 1935) to get the result
  • No temporary table. Uses just index.

2nd query (INNER JOIN with GROUP BY):

 SELECT job_posting.*
 FROM job_category_posting
 inner join job_posting on job_category_posting.category_posting_posting_id = job_posting.posting_id
 WHERE category_posting_category_id IN (2,13,22,23,24,25)
GROUP BY category_posting_posting_id

Quick dirty speed tests (doesn’t tell me much):

  • 0.0016 seconds
  • 0.0011 seconds
  • 0.0010 seconds
  • 0.0019 seconds

EXPLAIN gives me this:
2nd query using join

What I noticed:

  • Query plan has gone through only 1935 + 1 rows
  • But it uses temporary table

So my question is, which is better? Is there a better explanation that can justify it? I just need some solid facts and proof.

Or is there a third query I should try?

Any advice is appreciated!

2

Answers


  1. A few things:

    1. You have appropriate indexes for both your queries.

    2. Execution plans often change as tables grow. The work you do to save a few hundred microseconds on a small table is not necessarily useful for a larger table. You’ll probably need to revisit execution plans as your tables grow.

    3. "Using temporary" doesn’t mean your query uses a full-blown on-disk temporary table. It simply means the software accumulates a result set into a temporary data structure prior to, in your case, deduplicating it. (Only if that temporary data structure is too big for RAM does the software use on-disk structures. Yours certainly fits in RAM.) Don’t be fooled by the inaccurate, but time-honored, language in the execution plan’s extra column. "Using temporary" is OK.

    4. The value IN (set of values) predicate automatically deduplicates the set of values. So your first query can be rewritten without the GROUP BY thusly.

      SELECT t1.*
        FROM job_posting AS t1
       WHERE t1.posting_id IN (
              SELECT category_posting_posting_id
                FROM job_category_posting
               WHERE category_posting_category_id IN (2,13,22,23,24,25)
             )
      

      This is the query I would use, because (in my opinion) it expresses your intent most clearly. And, I suspect it scales up better to large tables because it does the deduplication work on the set of posting_id values alone, not whole rows.

    5. Your second query misuses MySQL’s notorious nonstandard extension to GROUP BY. Disable that extension with SET sql_mode = CONCAT_WS(',',@@sql_mode, 'ONLY_FULL_GROUP_BY'), then try your query again. You’ll need more terms in your GROUP BY clause. Better yet, get rid of the GROUP BY and use DISTINCT, like this.

      SELECT DISTINCT job_posting.*
        FROM job_category_posting
       inner join job_posting
           on job_category_posting.category_posting_posting_id = job_posting.posting_id
       WHERE category_posting_category_id IN (2,13,22,23,24,25)
      

      But this has to deduplicate whole rows.

    Login or Signup to reply.
    • job_category_posting seems to be a many-to-many mapping table. If so, the indexes you have are sub-optimal. See http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table

    • See this for a way to more precisely gauge which is faster (by counting rows touched): mysql.rjweb.org/doc.php/index_cookbook_mysql#handler_counts

    • Avoid using IN ( SELECT ... ); it is usually sub-optimal. JOIN or EXISTS is usually better.

    • Since the GROUP BY seems to apply to one table, try this formulation:

        SELECT ...
            FROM ( SELECT ... GROUP BY ... ) AS a
            JOIN ... AS b  ON ...
      

      That way, the subquery shrinks the number of rows to look at before reaching into the other table.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search