skip to Main Content

A fun little MySQL puzzle!

Given the dataset

id Description Amount Category
1 Food #1 1 1
2 Non-food #2 bla 3 2
3 Food banana 5 1
4 Non-food #2 foo 1 2
5 Non-food #2 foo bar 3 2

How do I get this easily grouped by category into:

Description TotalAmount Category
Food 6 1
Non-food #2 7 2

Thus a
SELECT COMMON_PREFIX(Description) AS Description, SUM(Amount) AS TotalAmount, Category FROM table GROUP BY Category

How to implement the fictious aggregate (GROUP BY) function COMMON_PREFIX?

I tried GROUP_CONCATenating the Description separated by a unique separator and with REGEXP find the common prefix, this would work were it not that MySQL doesn’t support back references in the search string (it works in other languages that do support them).

Any other ideas?

To be clear, I need the first words (no matter how many) that all GROUPED BY records have in common. The dataset is small, so GROUP_CONCAT, if needed, is an option.

Now of course, good database design would prevent a query like this from happening. But sometimes you have to add something in an existing application and it isn’t worth redesigning your database for just that one little query that is sparsely used. Adding fields to tables that are nearly never used clutters your database and make matters more confusing as well…

2

Answers


  1. Let me give you quick solution. It’s not the best because it assumes you know how many words can be at most in the common prefix. In this example I used 6 words MAX, you can easily increase it as you need.

    Please let me know if that’s not your case and you need really generic solution for ANY number of words – I can do it as well, but it will be harder.

    The idea is the following: for each group (grouped by category) I get MIN(description) and MAX(description). If they’re equal – great, we know it’s the same within the category, if they’re different, then try split them by words and check if first 5 words are the same, if not – then if first 4 words are the same etc.

    Here is the working SQL Query:

    SELECT
      CASE 
        WHEN d_min = d_max THEN d_min
        WHEN d_min5 = d_max5 THEN d_min5
        WHEN d_min4 = d_max4 THEN d_min4
        WHEN d_min3 = d_max3 THEN d_min3
        WHEN d_min2 = d_max2 THEN d_min2
        WHEN d_min1 = d_max1 THEN d_min1  
        ELSE ''
      END description,
      category,
      total_amount
    FROM (
      SELECT
        d_min,
        d_max,
        SUBSTRING_INDEX(d_min, ' ', 1) d_min1,
        SUBSTRING_INDEX(d_max, ' ', 1) d_max1,
        SUBSTRING_INDEX(d_min, ' ', 2) d_min2,
        SUBSTRING_INDEX(d_max, ' ', 2) d_max2,
        SUBSTRING_INDEX(d_min, ' ', 3) d_min3,
        SUBSTRING_INDEX(d_max, ' ', 3) d_max3,
        SUBSTRING_INDEX(d_min, ' ', 4) d_min4,
        SUBSTRING_INDEX(d_max, ' ', 4) d_max4,
        SUBSTRING_INDEX(d_min, ' ', 5) d_min5,
        SUBSTRING_INDEX(d_max, ' ', 5) d_max5,
        category,
        total_amount
      FROM (
        SELECT
          MIN(description) d_min,
          MAX(description) d_max,
          category,
          SUM(amount) total_amount
        FROM my_table
        GROUP BY category
      ) t1
    ) t2
    

    SQL Query Result

    Login or Signup to reply.
  2. Not a solution.

    The below query provides all possible prefixes which are found in more than one row.

    CREATE TABLE tab (
      `id` INTEGER,
      `Description` VARCHAR(100),
      `Amount` INTEGER,
      `Category` INTEGER
    );
    
    INSERT INTO tab
      (`id`, `Description`, `Amount`, `Category`)
    VALUES
      ('1', 'Food #1', '1', '1'),
      ('2', 'Non-food bla', '3', '2'),
      ('3', 'Food banana', '5', '1'),
      ('4', 'Non-food foo', '1', '2'),
      ('5', 'Multiword prefix #1', '3', '2'),
      ('6', 'Multiword prefix #2', '1', '1'),
      ('7', 'Lonely prefix', '1', '1'),
      ('8', 'Multiword', '1', '1');
    
    SELECT * FROM tab;
    
    Records: 8  Duplicates: 0  Warnings: 0
    
    id Description Amount Category
    1 Food #1 1 1
    2 Non-food bla 3 2
    3 Food banana 5 1
    4 Non-food foo 1 2
    5 Multiword prefix #1 3 2
    6 Multiword prefix #2 1 1
    7 Lonely prefix 1 1
    8 Multiword 1 1
    WITH RECURSIVE 
    cte AS (
      SELECT id, description, 1 amount, SUBSTRING_INDEX(description, ' ', 1) prefix
      FROM tab
      UNION ALL
      SELECT t1.id, t1.description, 
             t1.amount + 1, SUBSTRING_INDEX(t1.description, ' ', t1.amount + 1)
      FROM cte t1
      JOIN tab t2 ON t1.id <> t2.id
                 AND SUBSTRING_INDEX(t1.description, ' ', t1.amount + 1) = SUBSTRING_INDEX(t2.description, ' ', t1.amount + 1)
    )
    SELECT DISTINCT 
           tab.id, tab.description, cte.prefix
    FROM tab
    JOIN cte ON LOCATE(cte.prefix, tab.description) = 1
    ORDER BY id, LENGTH(prefix)
    
    id description prefix
    1 Food #1 Food
    2 Non-food bla Non-food
    3 Food banana Food
    4 Non-food foo Non-food
    5 Multiword prefix #1 Multiword
    5 Multiword prefix #1 Multiword prefix
    6 Multiword prefix #2 Multiword
    6 Multiword prefix #2 Multiword prefix
    7 Lonely prefix Lonely
    8 Multiword Multiword

    fiddle

    But the fiddle demonstrates the problem. Source row 8 is a member of a group with common prefix "Multiword" which contains 3 rows totally. 2 another rows have longer common prefix "Multiword prefix". But if we use this longer prefix for these 2 rows then the row 8 will stay alone in its shorter prefix group… correct solution needs the longer prefix to be ignored, and a group which uses more short prefix group of 3 members to be returned. But this solution may be too complex.. and I don’t know does this is reasonable.

    And another problem. Source row 7 have uniquely prefixed description. My query returns only 1st word of it.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search