skip to Main Content

I have a main table called pics that has an ID number and the origin site from which it was pulled from. I also have another table called tags that has a list of all the tags that were also pulled from that site. Here’s the kicker, there are known duplicates. The way in which duplicates are known is irrelevant. The duplicates are also stored in their own table. The first column, "original", is the first instance of it in the "pics" table. The second column is any additional duplicate. The website searches images based on tags, which can be more than one tag.
(I know having the source site in the pics table may not be optimal, nor proper database design.) This must work on MySQL/phpMyAdmin.

enter image description here

1. What is the SQL join query that will list all of the image ID numbers when searching for a specific tag? For example:

  • Search String: fire, Result: [null]
  • Search String: red, Result: #1
  • Search String: vines, Result: #2
  • Search String: green, Result: #2
  • Search String: forest brown, Result: #2
  • Search String: bird tree, Result: [null]

2. What is the SQL join query that will merge the image source sites? For example:

  • Site(s) for image: 1, Result: colors.com
  • Site(s) for image: 2, Result: colors.com nature.com
  • Site(s) for image: 4, Result: nature.com
  • (Getting the sites for #5 and #6 doesn’t matter because they would always reference #2 anyway.)

3. OR would it honestly just be easier to not do this complicated join stuff and do simpler join queries and have an automated, scheduled procedure manually merge tags and delete duplicate entries?

Here is the SQL to copy paste into a blank database:

START TRANSACTION;
CREATE TABLE `duplicates` (`original` int(2) NOT NULL,`additional` int(2) NOT NULL);
INSERT INTO `duplicates` (`original`, `additional`) VALUES (2, 5),(2, 6);
CREATE TABLE `pics` (`id` int(2) NOT NULL,`sourceSite` text NOT NULL);
INSERT INTO `pics` (`id`, `sourceSite`) VALUES (1, 'colors.com'),(2, 'colors.com'),(3, 'colors.com'),(4, 'nature.com'),(5, 'nature.com'),(6, 'nature.com');
CREATE TABLE `tags` (`id` int(2) NOT NULL,`tag` varchar(16) NOT NULL);
INSERT INTO `tags` (`id`, `tag`) VALUES (1, 'red'),(1, 'orange'),(2, 'green'),(2, 'brown'),(3, 'blue'),(3, 'cyan'),(4, 'bird'),(4, 'flight'),(5, 'tree'),(5, 'forest'),(6, 'vines'),(6, 'tree');
ALTER TABLE `duplicates` ADD KEY `original` (`original`), ADD KEY `additional` (`additional`);
ALTER TABLE `pics` ADD PRIMARY KEY (`id`);
ALTER TABLE `tags` ADD KEY `id` (`id`);
ALTER TABLE `duplicates` ADD CONSTRAINT `duplicates_ibfk_1` FOREIGN KEY (`original`) REFERENCES `pics` (`id`) ON DELETE CASCADE ON UPDATE CASCADE, ADD CONSTRAINT `duplicates_ibfk_2` FOREIGN KEY (`additional`) REFERENCES `pics` (`id`) ON DELETE CASCADE ON UPDATE CASCADE;
ALTER TABLE `tags` ADD CONSTRAINT `tags_ibfk_1` FOREIGN KEY (`id`) REFERENCES `pics` (`id`) ON DELETE CASCADE ON UPDATE CASCADE;
COMMIT;

2

Answers


  1. Here’s a graphic representation of your schema. Let’s work with this.

    1. SQL query to list images under a given tag:

      SELECT pics.*
      FROM tags
          LEFT JOIN duplicates ON tags.id = duplicates.original OR tags.id = duplicates.additional
          LEFT JOIN pics on duplicates.original = pics.id
      WHERE tags.tag = "green"
      GROUP BY duplicates.original;
      

      That is, given an input tag, e.g. green:

      1. query for tags whose tags.tag equal "green",
      2. pull up any duplicates that are related by tags.id,
      3. "squash" these duplicates into one by grouping them by duplicates.original, so that only one duplicate "prevails", then
      4. get the corresponding pics.* based on the result above.

      In hindsight, you might want to consider creating foreign key constraints between tables tags and duplicates.

    2. SQL query to list source sites of images:

      SELECT pics.sourceSite
      FROM duplicates
          LEFT JOIN pics ON duplicates.additional = pics.id
      WHERE duplicates.original = 2 OR duplicates.additional = 2
      GROUP BY pics.sourceSite
      

      That is, given an input image ID, e.g. 2:

      1. look up for duplicates that references the image ID 2,
      2. query pics associated to any duplicates in the result above, then
      3. group by and select the source sites of these pictures.
    3. Ideally, you want to prune duplicates and normalize tags. But in reality, there might be valid factors for not doing so. In short, it depends.™

    Login or Signup to reply.
  2. Doing this with your current structure is very inefficient but can be done.

    Searching the pic IDs based on a set of tags:

    SELECT COALESCE(d.original, t.id) AS id
    FROM tags t
    LEFT JOIN duplicates d ON d.additional = t.id
    WHERE t.tag IN ('forest', 'brown', 'vines')
    GROUP BY 1
    HAVING COUNT(DISTINCT t.tag) = 3 -- because we have 3 tags to find
    

    Searching for sources of a given pic ID:

    SELECT p.sourceSite
    FROM pics p
    WHERE p.id = 2
    UNION
    SELECT p.sourceSite
    FROM pics p
    JOIN duplicates d ON d.additional = p.id
    WHERE d.original = 2;
    

    If you want to tie both of the above together so you can search for pics based on a set of tags and return the id, sources and full list of tags you can do the following.

    First step: get the ids of the images with all of the required tags:

    WITH pics_by_tags (pic_id) AS (
        SELECT COALESCE(d.original, t.id)
        FROM tags t
        LEFT JOIN duplicates d ON d.additional = t.id
        WHERE t.tag IN ('forest', 'brown', 'vines')
        GROUP BY 1
        HAVING COUNT(DISTINCT t.tag) = 3 -- because we have 3 tags to find
    )
    SELECT * FROM pics_by_tags;
    

    Output:

    pic_id
    2

    Second step: get sources for each of the pics returned in previous step:

    WITH pics_by_tags (pic_id) AS (
        /* as before */
    ),
    pics_with_sites (pic_id, sites) AS (
        SELECT pbt.pic_id, GROUP_CONCAT(DISTINCT p.sourceSite)
        FROM pics_by_tags pbt
        JOIN pics p
            ON pbt.pic_id = p.id
            OR EXISTS (SELECT 1 FROM duplicates d WHERE pbt.pic_id = d.original AND p.ID = d.additional)
        GROUP BY 1
    )
    SELECT * FROM pics_with_sites;
    

    Output:

    pic_id sites
    2 colors.com,nature.com

    Third step: get full list of tags for result of previous step:

    WITH pics_by_tags (pic_id) AS (
        /* as before */
    ),
    pics_with_sites (pic_id, sites) AS (
        /* as before */
    )
    SELECT p.pic_id, p.sites, GROUP_CONCAT(DISTINCT t.tag ORDER BY t.tag) AS tags
    FROM pics_with_sites p
    JOIN tags t
        ON p.pic_id = t.id
        OR EXISTS (SELECT 1 FROM duplicates d WHERE p.pic_id = d.original AND t.id = d.additional)
    GROUP BY p.pic_id, p.sites;
    

    Output:

    pic_id sites tags
    2 colors.com,nature.com brown,forest,green,tree,vines

    The full query is:

    WITH pics_by_tags (pic_id) AS (
        SELECT COALESCE(d.original, t.id)
        FROM tags t
        LEFT JOIN duplicates d ON d.additional = t.id
        WHERE t.tag IN ('forest', 'brown', 'vines')
        GROUP BY 1
        HAVING COUNT(DISTINCT t.tag) = 3 -- because we have 3 tags to find
    ),
    pics_with_sites (pic_id, sites) AS (
        SELECT pbt.pic_id, GROUP_CONCAT(DISTINCT p.sourceSite)
        FROM pics_by_tags pbt
        JOIN pics p
            ON pbt.pic_id = p.ID
            OR EXISTS (SELECT 1 FROM duplicates d WHERE pbt.pic_id = d.original AND p.id = d.additional)
        GROUP BY 1
    )
    SELECT p.pic_id, p.sites, GROUP_CONCAT(DISTINCT t.tag ORDER BY t.tag) AS tags
    FROM pics_with_sites p
    JOIN tags t
        ON p.pic_id = t.id
        OR EXISTS (SELECT 1 FROM duplicates d WHERE p.pic_id = d.original AND t.id = d.additional)
    GROUP BY p.pic_id, p.sites;
    

    Obviously, this task would be much better served by a normalized structure, without the duplication.

    Something like:

    pictures (id, other_picture_meta_data)
    sources (id, name)
    tags_new (id, name)
    pics_sources (pic_id, source_id)
    pics_tags (pic_id, tag_id)
    

    With the revised structure the query becomes:

    WITH pics_by_tags (pic_id) AS (
        SELECT pt.pic_id
        FROM tags_new tn
        JOIN pics_tags pt ON tn.id = pt.tag_id
        WHERE tn.name IN ('forest', 'brown', 'vines')
        GROUP BY 1
        HAVING COUNT(tn.name) = 3 -- because we have 3 tags to find
    ),
    pics_with_sites (pic_id, sites) AS (
        SELECT pbt.pic_id, GROUP_CONCAT(s.name)
        FROM pics_by_tags pbt
        JOIN pics_sources ps ON pbt.pic_id = ps.pic_id
        JOIN sources s ON ps.source_id = s.id
        GROUP BY 1
    )
    SELECT p.*, pws.sites, GROUP_CONCAT(tn.name ORDER BY tn.name) AS tags
    FROM pics_with_sites pws
    JOIN pictures p ON pws.pic_id = p.id
    JOIN pics_tags pt ON pws.pic_id = pt.pic_id
    JOIN tags_new tn ON pt.tag_id = tn.id
    GROUP BY p.id;
    

    Here’s a db<>fiddle to play with.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search