Mysql - How do you query a main table with known duplicates that also have tags?

Surgemus
September 18, 2023
204 views
1 vote
2 Answers

I have a main table called pics that has an ID number and the origin site from which it was pulled from. I also have another table called tags that has a list of all the tags that were also pulled from that site. Here’s the kicker, there are known duplicates. The way in which duplicates are known is irrelevant. The duplicates are also stored in their own table. The first column, "original", is the first instance of it in the "pics" table. The second column is any additional duplicate. The website searches images based on tags, which can be more than one tag.
(I know having the source site in the pics table may not be optimal, nor proper database design.) This must work on MySQL/phpMyAdmin.

1. What is the SQL join query that will list all of the image ID numbers when searching for a specific tag? For example:

Search String: fire, Result: [null]
Search String: red, Result: #1
Search String: vines, Result: #2
Search String: green, Result: #2
Search String: forest brown, Result: #2
Search String: bird tree, Result: [null]

2. What is the SQL join query that will merge the image source sites? For example:

Site(s) for image: 1, Result: colors.com
Site(s) for image: 2, Result: colors.com nature.com
Site(s) for image: 4, Result: nature.com
(Getting the sites for #5 and #6 doesn’t matter because they would always reference #2 anyway.)

3. OR would it honestly just be easier to not do this complicated join stuff and do simpler join queries and have an automated, scheduled procedure manually merge tags and delete duplicate entries?

Here is the SQL to copy paste into a blank database:

START TRANSACTION;
CREATE TABLE `duplicates` (`original` int(2) NOT NULL,`additional` int(2) NOT NULL);
INSERT INTO `duplicates` (`original`, `additional`) VALUES (2, 5),(2, 6);
CREATE TABLE `pics` (`id` int(2) NOT NULL,`sourceSite` text NOT NULL);
INSERT INTO `pics` (`id`, `sourceSite`) VALUES (1, 'colors.com'),(2, 'colors.com'),(3, 'colors.com'),(4, 'nature.com'),(5, 'nature.com'),(6, 'nature.com');
CREATE TABLE `tags` (`id` int(2) NOT NULL,`tag` varchar(16) NOT NULL);
INSERT INTO `tags` (`id`, `tag`) VALUES (1, 'red'),(1, 'orange'),(2, 'green'),(2, 'brown'),(3, 'blue'),(3, 'cyan'),(4, 'bird'),(4, 'flight'),(5, 'tree'),(5, 'forest'),(6, 'vines'),(6, 'tree');
ALTER TABLE `duplicates` ADD KEY `original` (`original`), ADD KEY `additional` (`additional`);
ALTER TABLE `pics` ADD PRIMARY KEY (`id`);
ALTER TABLE `tags` ADD KEY `id` (`id`);
ALTER TABLE `duplicates` ADD CONSTRAINT `duplicates_ibfk_1` FOREIGN KEY (`original`) REFERENCES `pics` (`id`) ON DELETE CASCADE ON UPDATE CASCADE, ADD CONSTRAINT `duplicates_ibfk_2` FOREIGN KEY (`additional`) REFERENCES `pics` (`id`) ON DELETE CASCADE ON UPDATE CASCADE;
ALTER TABLE `tags` ADD CONSTRAINT `tags_ibfk_1` FOREIGN KEY (`id`) REFERENCES `pics` (`id`) ON DELETE CASCADE ON UPDATE CASCADE;
COMMIT;

Answers

- Christian
- September 18, 2023 at 11:25 am
- 0 votes
0
Here’s a graphic representation of your schema. Let’s work with this.
1. SQL query to list images under a given tag:
```
SELECT pics.*
FROM tags
    LEFT JOIN duplicates ON tags.id = duplicates.original OR tags.id = duplicates.additional
    LEFT JOIN pics on duplicates.original = pics.id
WHERE tags.tag = "green"
GROUP BY duplicates.original;
```
  That is, given an input tag, e.g. green:
  1. query for tags whose tags.tag equal "green",
  2. pull up any duplicates that are related by tags.id,
  3. "squash" these duplicates into one by grouping them by duplicates.original, so that only one duplicate "prevails", then
  4. get the corresponding pics.* based on the result above.
  In hindsight, you might want to consider creating foreign key constraints between tables tags and duplicates.
2. SQL query to list source sites of images:
```
SELECT pics.sourceSite
FROM duplicates
    LEFT JOIN pics ON duplicates.additional = pics.id
WHERE duplicates.original = 2 OR duplicates.additional = 2
GROUP BY pics.sourceSite
```
  That is, given an input image ID, e.g. 2:
  1. look up for duplicates that references the image ID 2,
  2. query pics associated to any duplicates in the result above, then
  3. group by and select the source sites of these pictures.
3. Ideally, you want to prune duplicates and normalize tags. But in reality, there might be valid factors for not doing so. In short, it depends.™
Login or Signup to reply.

Doing this with your current structure is very inefficient but can be done.

Searching the pic IDs based on a set of tags:

SELECT COALESCE(d.original, t.id) AS id
FROM tags t
LEFT JOIN duplicates d ON d.additional = t.id
WHERE t.tag IN ('forest', 'brown', 'vines')
GROUP BY 1
HAVING COUNT(DISTINCT t.tag) = 3 -- because we have 3 tags to find

Searching for sources of a given pic ID:

SELECT p.sourceSite
FROM pics p
WHERE p.id = 2
UNION
SELECT p.sourceSite
FROM pics p
JOIN duplicates d ON d.additional = p.id
WHERE d.original = 2;

If you want to tie both of the above together so you can search for pics based on a set of tags and return the id, sources and full list of tags you can do the following.

First step: get the ids of the images with all of the required tags:

WITH pics_by_tags (pic_id) AS (
    SELECT COALESCE(d.original, t.id)
    FROM tags t
    LEFT JOIN duplicates d ON d.additional = t.id
    WHERE t.tag IN ('forest', 'brown', 'vines')
    GROUP BY 1
    HAVING COUNT(DISTINCT t.tag) = 3 -- because we have 3 tags to find
)
SELECT * FROM pics_by_tags;

Output:

pic_id
2

Second step: get sources for each of the pics returned in previous step:

WITH pics_by_tags (pic_id) AS (
    /* as before */
),
pics_with_sites (pic_id, sites) AS (
    SELECT pbt.pic_id, GROUP_CONCAT(DISTINCT p.sourceSite)
    FROM pics_by_tags pbt
    JOIN pics p
        ON pbt.pic_id = p.id
        OR EXISTS (SELECT 1 FROM duplicates d WHERE pbt.pic_id = d.original AND p.ID = d.additional)
    GROUP BY 1
)
SELECT * FROM pics_with_sites;

Output:

pic_id	sites
2	colors.com,nature.com

Third step: get full list of tags for result of previous step:

WITH pics_by_tags (pic_id) AS (
    /* as before */
),
pics_with_sites (pic_id, sites) AS (
    /* as before */
)
SELECT p.pic_id, p.sites, GROUP_CONCAT(DISTINCT t.tag ORDER BY t.tag) AS tags
FROM pics_with_sites p
JOIN tags t
    ON p.pic_id = t.id
    OR EXISTS (SELECT 1 FROM duplicates d WHERE p.pic_id = d.original AND t.id = d.additional)
GROUP BY p.pic_id, p.sites;

Output:

pic_id	sites	tags
2	colors.com,nature.com	brown,forest,green,tree,vines

The full query is:

WITH pics_by_tags (pic_id) AS (
    SELECT COALESCE(d.original, t.id)
    FROM tags t
    LEFT JOIN duplicates d ON d.additional = t.id
    WHERE t.tag IN ('forest', 'brown', 'vines')
    GROUP BY 1
    HAVING COUNT(DISTINCT t.tag) = 3 -- because we have 3 tags to find
),
pics_with_sites (pic_id, sites) AS (
    SELECT pbt.pic_id, GROUP_CONCAT(DISTINCT p.sourceSite)
    FROM pics_by_tags pbt
    JOIN pics p
        ON pbt.pic_id = p.ID
        OR EXISTS (SELECT 1 FROM duplicates d WHERE pbt.pic_id = d.original AND p.id = d.additional)
    GROUP BY 1
)
SELECT p.pic_id, p.sites, GROUP_CONCAT(DISTINCT t.tag ORDER BY t.tag) AS tags
FROM pics_with_sites p
JOIN tags t
    ON p.pic_id = t.id
    OR EXISTS (SELECT 1 FROM duplicates d WHERE p.pic_id = d.original AND t.id = d.additional)
GROUP BY p.pic_id, p.sites;

Obviously, this task would be much better served by a normalized structure, without the duplication.

Something like:

pictures (id, other_picture_meta_data)
sources (id, name)
tags_new (id, name)
pics_sources (pic_id, source_id)
pics_tags (pic_id, tag_id)

With the revised structure the query becomes:

WITH pics_by_tags (pic_id) AS (
    SELECT pt.pic_id
    FROM tags_new tn
    JOIN pics_tags pt ON tn.id = pt.tag_id
    WHERE tn.name IN ('forest', 'brown', 'vines')
    GROUP BY 1
    HAVING COUNT(tn.name) = 3 -- because we have 3 tags to find
),
pics_with_sites (pic_id, sites) AS (
    SELECT pbt.pic_id, GROUP_CONCAT(s.name)
    FROM pics_by_tags pbt
    JOIN pics_sources ps ON pbt.pic_id = ps.pic_id
    JOIN sources s ON ps.source_id = s.id
    GROUP BY 1
)
SELECT p.*, pws.sites, GROUP_CONCAT(tn.name ORDER BY tn.name) AS tags
FROM pics_with_sites pws
JOIN pictures p ON pws.pic_id = p.id
JOIN pics_tags pt ON pws.pic_id = pt.pic_id
JOIN tags_new tn ON pt.tag_id = tn.id
GROUP BY p.id;

Here’s a db<>fiddle to play with.

Please signup or login to give your own answer.

Click here to cancel reply.

Mysql – How do you query a main table with known duplicates that also have tags?

Answers