A fun little MySQL puzzle!
Given the dataset
id | Description | Amount | Category |
---|---|---|---|
1 | Food #1 | 1 | 1 |
2 | Non-food #2 bla | 3 | 2 |
3 | Food banana | 5 | 1 |
4 | Non-food #2 foo | 1 | 2 |
5 | Non-food #2 foo bar | 3 | 2 |
How do I get this easily grouped by category into:
Description | TotalAmount | Category |
---|---|---|
Food | 6 | 1 |
Non-food #2 | 7 | 2 |
Thus a
SELECT COMMON_PREFIX(Description) AS Description, SUM(Amount) AS TotalAmount, Category FROM table GROUP BY Category
How to implement the fictious aggregate (GROUP BY) function COMMON_PREFIX
?
I tried GROUP_CONCATenating the Description separated by a unique separator and with REGEXP find the common prefix, this would work were it not that MySQL doesn’t support back references in the search string (it works in other languages that do support them).
Any other ideas?
To be clear, I need the first words (no matter how many) that all GROUPED BY records have in common. The dataset is small, so GROUP_CONCAT, if needed, is an option.
Now of course, good database design would prevent a query like this from happening. But sometimes you have to add something in an existing application and it isn’t worth redesigning your database for just that one little query that is sparsely used. Adding fields to tables that are nearly never used clutters your database and make matters more confusing as well…
2
Answers
Let me give you quick solution. It’s not the best because it assumes you know how many words can be at most in the common prefix. In this example I used 6 words MAX, you can easily increase it as you need.
Please let me know if that’s not your case and you need really generic solution for ANY number of words – I can do it as well, but it will be harder.
The idea is the following: for each group (grouped by category) I get MIN(description) and MAX(description). If they’re equal – great, we know it’s the same within the category, if they’re different, then try split them by words and check if first 5 words are the same, if not – then if first 4 words are the same etc.
Here is the working SQL Query:
Not a solution.
The below query provides all possible prefixes which are found in more than one row.
fiddle
But the fiddle demonstrates the problem. Source row 8 is a member of a group with common prefix "Multiword" which contains 3 rows totally. 2 another rows have longer common prefix "Multiword prefix". But if we use this longer prefix for these 2 rows then the row 8 will stay alone in its shorter prefix group… correct solution needs the longer prefix to be ignored, and a group which uses more short prefix group of 3 members to be returned. But this solution may be too complex.. and I don’t know does this is reasonable.
And another problem. Source row 7 have uniquely prefixed description. My query returns only 1st word of it.