I have to retrieve a list of postings, by one or more category IDs.
I do not want to have duplicate postings in my results.
I’m only interested in responses that directly relate, or can draw relations with MySQL 8
There are two queries I’m considering, and I’m deciding which one is better. Or, if there is a better "3rd query", please advise.
Consider a simple two table structure:
CREATE TABLE `job_category_posting` (
`category_posting_id` int UNSIGNED NOT NULL,
`category_posting_category_id` int UNSIGNED NOT NULL,
`category_posting_posting_id` int UNSIGNED NOT NULL,
`category_posting_is_primary_category` tinyint UNSIGNED DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
CREATE TABLE `job_posting` (
`posting_id` int UNSIGNED NOT NULL,
`posting_title` varchar(250) NOT NULL,
`posting_body` mediumtext CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
ALTER TABLE `job_category_posting`
ADD PRIMARY KEY (`category_posting_id`),
ADD UNIQUE KEY `category_posting_category_id` (`category_posting_category_id`,`category_posting_posting_id`),
ADD UNIQUE KEY `category_posting_is_primary_category` (`category_posting_is_primary_category`,`category_posting_posting_id`),
ADD KEY `category_posting_posting_id` (`category_posting_posting_id`) USING BTREE;
ALTER TABLE `job_posting`
ADD PRIMARY KEY (`posting_id`),
ADD UNIQUE KEY `posting_reserve_id` (`posting_reserve_id`),
ADD KEY `posting_title` (`posting_title`);
1st query (SUBQUERY with GROUP BY):
SELECT t1.*
FROM job_posting AS t1
WHERE (t1.posting_id) IN(
SELECT category_posting_posting_id
FROM job_category_posting
WHERE category_posting_category_id IN (2,13,22,23,24,25)
GROUP BY category_posting_posting_id
)
Quick dirty speed tests (doesn’t tell me much):
- 0.0017 seconds
- 0.0016 seconds
- 0.0011 seconds
- 0.0017 seconds
What I noticed:
- Query plan has gone through quite a number of rows (2356 + 1 + 1935) to get the result
- No temporary table. Uses just index.
2nd query (INNER JOIN with GROUP BY):
SELECT job_posting.*
FROM job_category_posting
inner join job_posting on job_category_posting.category_posting_posting_id = job_posting.posting_id
WHERE category_posting_category_id IN (2,13,22,23,24,25)
GROUP BY category_posting_posting_id
Quick dirty speed tests (doesn’t tell me much):
- 0.0016 seconds
- 0.0011 seconds
- 0.0010 seconds
- 0.0019 seconds
What I noticed:
- Query plan has gone through only 1935 + 1 rows
- But it uses temporary table
So my question is, which is better? Is there a better explanation that can justify it? I just need some solid facts and proof.
Or is there a third query I should try?
Any advice is appreciated!
2
Answers
A few things:
You have appropriate indexes for both your queries.
Execution plans often change as tables grow. The work you do to save a few hundred microseconds on a small table is not necessarily useful for a larger table. You’ll probably need to revisit execution plans as your tables grow.
"Using temporary" doesn’t mean your query uses a full-blown on-disk temporary table. It simply means the software accumulates a result set into a temporary data structure prior to, in your case, deduplicating it. (Only if that temporary data structure is too big for RAM does the software use on-disk structures. Yours certainly fits in RAM.) Don’t be fooled by the inaccurate, but time-honored, language in the execution plan’s
extra
column. "Using temporary" is OK.The
value IN (set of values)
predicate automatically deduplicates the set of values. So your first query can be rewritten without the GROUP BY thusly.This is the query I would use, because (in my opinion) it expresses your intent most clearly. And, I suspect it scales up better to large tables because it does the deduplication work on the set of
posting_id
values alone, not whole rows.Your second query misuses MySQL’s notorious nonstandard extension to GROUP BY. Disable that extension with
SET sql_mode = CONCAT_WS(',',@@sql_mode, 'ONLY_FULL_GROUP_BY')
, then try your query again. You’ll need more terms in your GROUP BY clause. Better yet, get rid of the GROUP BY and use DISTINCT, like this.But this has to deduplicate whole rows.
job_category_posting
seems to be a many-to-many mapping table. If so, the indexes you have are sub-optimal. See http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_tableSee this for a way to more precisely gauge which is faster (by counting rows touched): mysql.rjweb.org/doc.php/index_cookbook_mysql#handler_counts
Avoid using
IN ( SELECT ... )
; it is usually sub-optimal.JOIN
orEXISTS
is usually better.Since the
GROUP BY
seems to apply to one table, try this formulation:That way, the subquery shrinks the number of rows to look at before reaching into the other table.