MySQL find common start of string/orefix of column in standard SELECT ... GROUP BY query

Roemer
February 12, 2024
305 views
0 votes
2 Answers

A fun little MySQL puzzle!

Given the dataset

id	Description	Amount	Category
1	Food #1	1	1
2	Non-food #2 bla	3	2
3	Food banana	5	1
4	Non-food #2 foo	1	2
5	Non-food #2 foo bar	3	2

How do I get this easily grouped by category into:

Description	TotalAmount	Category
Food	6	1
Non-food #2	7	2

Thus a
SELECT COMMON_PREFIX(Description) AS Description, SUM(Amount) AS TotalAmount, Category FROM table GROUP BY Category

How to implement the fictious aggregate (GROUP BY) function COMMON_PREFIX?

I tried GROUP_CONCATenating the Description separated by a unique separator and with REGEXP find the common prefix, this would work were it not that MySQL doesn’t support back references in the search string (it works in other languages that do support them).

Any other ideas?

To be clear, I need the first words (no matter how many) that all GROUPED BY records have in common. The dataset is small, so GROUP_CONCAT, if needed, is an option.

Now of course, good database design would prevent a query like this from happening. But sometimes you have to add something in an existing application and it isn’t worth redesigning your database for just that one little query that is sparsely used. Adding fields to tables that are nearly never used clutters your database and make matters more confusing as well…

Tags: group-by mysql

Answers

- Max237mG
- February 12, 2024 at 5:01 pm
- 0 votes
0
Let me give you quick solution. It’s not the best because it assumes you know how many words can be at most in the common prefix. In this example I used 6 words MAX, you can easily increase it as you need.

Please let me know if that’s not your case and you need really generic solution for ANY number of words – I can do it as well, but it will be harder.

The idea is the following: for each group (grouped by category) I get MIN(description) and MAX(description). If they’re equal – great, we know it’s the same within the category, if they’re different, then try split them by words and check if first 5 words are the same, if not – then if first 4 words are the same etc.

Here is the working SQL Query:
```
SELECT
  CASE 
    WHEN d_min = d_max THEN d_min
    WHEN d_min5 = d_max5 THEN d_min5
    WHEN d_min4 = d_max4 THEN d_min4
    WHEN d_min3 = d_max3 THEN d_min3
    WHEN d_min2 = d_max2 THEN d_min2
    WHEN d_min1 = d_max1 THEN d_min1  
    ELSE ''
  END description,
  category,
  total_amount
FROM (
  SELECT
    d_min,
    d_max,
    SUBSTRING_INDEX(d_min, ' ', 1) d_min1,
    SUBSTRING_INDEX(d_max, ' ', 1) d_max1,
    SUBSTRING_INDEX(d_min, ' ', 2) d_min2,
    SUBSTRING_INDEX(d_max, ' ', 2) d_max2,
    SUBSTRING_INDEX(d_min, ' ', 3) d_min3,
    SUBSTRING_INDEX(d_max, ' ', 3) d_max3,
    SUBSTRING_INDEX(d_min, ' ', 4) d_min4,
    SUBSTRING_INDEX(d_max, ' ', 4) d_max4,
    SUBSTRING_INDEX(d_min, ' ', 5) d_min5,
    SUBSTRING_INDEX(d_max, ' ', 5) d_max5,
    category,
    total_amount
  FROM (
    SELECT
      MIN(description) d_min,
      MAX(description) d_max,
      category,
      SUM(amount) total_amount
    FROM my_table
    GROUP BY category
  ) t1
) t2
```
Login or Signup to reply.

Not a solution.

The below query provides all possible prefixes which are found in more than one row.

CREATE TABLE tab (
  `id` INTEGER,
  `Description` VARCHAR(100),
  `Amount` INTEGER,
  `Category` INTEGER
);

INSERT INTO tab
  (`id`, `Description`, `Amount`, `Category`)
VALUES
  ('1', 'Food #1', '1', '1'),
  ('2', 'Non-food bla', '3', '2'),
  ('3', 'Food banana', '5', '1'),
  ('4', 'Non-food foo', '1', '2'),
  ('5', 'Multiword prefix #1', '3', '2'),
  ('6', 'Multiword prefix #2', '1', '1'),
  ('7', 'Lonely prefix', '1', '1'),
  ('8', 'Multiword', '1', '1');

SELECT * FROM tab;

Records: 8  Duplicates: 0  Warnings: 0

id	Description	Amount	Category
1	Food #1	1	1
2	Non-food bla	3	2
3	Food banana	5	1
4	Non-food foo	1	2
5	Multiword prefix #1	3	2
6	Multiword prefix #2	1	1
7	Lonely prefix	1	1
8	Multiword	1	1

WITH RECURSIVE 
cte AS (
  SELECT id, description, 1 amount, SUBSTRING_INDEX(description, ' ', 1) prefix
  FROM tab
  UNION ALL
  SELECT t1.id, t1.description, 
         t1.amount + 1, SUBSTRING_INDEX(t1.description, ' ', t1.amount + 1)
  FROM cte t1
  JOIN tab t2 ON t1.id <> t2.id
             AND SUBSTRING_INDEX(t1.description, ' ', t1.amount + 1) = SUBSTRING_INDEX(t2.description, ' ', t1.amount + 1)
)
SELECT DISTINCT 
       tab.id, tab.description, cte.prefix
FROM tab
JOIN cte ON LOCATE(cte.prefix, tab.description) = 1
ORDER BY id, LENGTH(prefix)

id	description	prefix
1	Food #1	Food
2	Non-food bla	Non-food
3	Food banana	Food
4	Non-food foo	Non-food
5	Multiword prefix #1	Multiword
5	Multiword prefix #1	Multiword prefix
6	Multiword prefix #2	Multiword
6	Multiword prefix #2	Multiword prefix
7	Lonely prefix	Lonely
8	Multiword	Multiword

fiddle

But the fiddle demonstrates the problem. Source row 8 is a member of a group with common prefix "Multiword" which contains 3 rows totally. 2 another rows have longer common prefix "Multiword prefix". But if we use this longer prefix for these 2 rows then the row 8 will stay alone in its shorter prefix group… correct solution needs the longer prefix to be ignored, and a group which uses more short prefix group of 3 members to be returned. But this solution may be too complex.. and I don’t know does this is reasonable.

And another problem. Source row 7 have uniquely prefixed description. My query returns only 1st word of it.

Please signup or login to give your own answer.

Click here to cancel reply.

MySQL find common start of string/orefix of column in standard SELECT … GROUP BY query

Answers

Not a solution.