skip to Main Content

I’m working on MySQL 8.0 and I wanted to extract characters from my data.

There are variations within the same product such as flavors, prices including shipping or not, size, etc.

The data includes the following products.

Product Total
Organic Rice Snacks – Apple 6
Organic Rice Snacks – Rice 5
62-Sam Almond Crisp Jujube – 20g 4
62-Sam Almond Crisp Jujube – 20g (members-only) 5
[Sale] Extra Virgin Sprouted Perilla Oil Best Before 9/3 or 9/11/2021 (160ml) 3
[Sale] Extra Virgin Sprouted Perilla Oil Best Before 9/3 or 9/11/2021 (300ml) 3.5
[Steak Family Set 4] Ribeye 12 oz. x 4 (Never frozen) 20
[Steak Family Set 4] Ribeye 12 oz. x 4 (Shipping Included) 21
Ecomom Sangol Organic Rice Snacks (Pack of Three) 6
Ecomom Sangol Organic Rice Snacks – Apple 7.5

And I would like to find the following result by grouping the variants of the products:

NewProduct GTotal
Organic Rice Snacks 11
62-Sam Almond Crisp Jujube 9
[Sale] Extra Virgin Sprouted Perilla Oil Best Before 9/3 or 9/11/2021 6.5
[Steak Family Set 4] Ribeye 41
Ecomom Sangol Organic Rice Snacks 13.5

I tried the following but it doesn’t apply to the whole data.

Select  rtrim(substring_index(substring_index(substring_index(Product, '-', 1), '(', 1), '[', 1)) as NewProduct, SUM(Total) as GTotal
From myData
Group By NewProduct

Could anyone help me solve this problem? I’d really appreciate it.

P.S. I asked a similar question on another thread but I realized my data is much more complicated.

2

Answers


  1. as Shadow pointed out, there is no common logic between these product names that you could use to perform the grouping to your requirement. You’ll never get it with pure SQL. But luckily we are here to help!

    You can’t group it for sure by removing some unallowed character and characters between them, but you can recursively check for occurrences between your product names, gradually including a greater part of the string. That needs some code to be achieved, here you have a python script that do the job

    import sqlite3
    
    
    class Product:
    
        UNALLOWEDS = ["-", "(", ")", "[", "]"]
    
        def __init__(self, id, name, amount):
            self.id = id
            self.name = name
            self.amount = amount
            self.normalizations = 0
            self.substr1 = ""
            self.substr2 = ""
    
        def normalize(self, products):
            occurences = 0
            positions = list()
    
            for unallowed in Product.UNALLOWEDS:
                if self.name.find(unallowed) != -1:
                    positions.extend([i for i, letter in enumerate(self.name) if letter == unallowed])
    
            for _ in range(self.normalizations):
                positions.pop(positions.index(min(positions)))
    
            positions = [i for i in positions if i != -1]
            if not len(positions):
                return self.id, self.name, self.amount
    
            substr = self.name[0:min(positions)].strip()
            for product in products:
                if product.id != self.id:
                    if product.name.startswith(substr):
                        occurences += 1
    
            if occurences and len(positions) > 1:
                self.normalizations += 1
                self.substr2 = self.substr1
                self.substr1 = substr
                return self.normalize(products=products)
            elif not occurences and self.substr2 and not self.name[0] in Product.UNALLOWEDS:
                return self.id, self.substr2, self.amount
            elif not occurences and self.substr1:
                return self.id, self.substr1, self.amount
            else:
                return self.id, substr, self.amount
    
    
    if __name__ == "__main__":
        products = list()
        normalizeds = list()
    
        con = sqlite3.connect("prod.db")
        cur = con.cursor()
    
        for row in cur.execute("SELECT * FROM products"):
            products.append(Product(row[0], row[1], row[2]))
    
        for product in products:
            normalized_data = product.normalize(products=products)
            normalizeds.append(Product(normalized_data[0], normalized_data[1], normalized_data[2]))
    
        for normalied in normalizeds:
            print(normalied.id, normalied.name, normalied.amount)
    
        con.close()
    
    

    I created the following table called products under a sqlite3 db

    id name amount
    1 Organic Rice Snacks – Apple 6.0
    2 Organic Rice Snacks – Rice 5.0
    3 62-Sam Almond Crisp Jujube – 20g 4.0
    4 62-Sam Almond Crisp Jujube – 20g (members-only) 5.0
    5 [Sale] Extra Virgin Sprouted Perilla Oil Best Before 9/3 or 9/11/2021 (160ml) 3.0
    6 [Sale] Extra Virgin Sprouted Perilla Oil Best Before 9/3 or 9/11/2021 (300ml) 3.5
    7 [Steak Family Set 4] Ribeye 12 oz. x 4 (Never frozen) 20.0
    8 [Steak Family Set 4] Ribeye 12 oz. x 4 (Shipping Included) 21.0
    9 Ecomom Sangol Organic Rice Snacks (Pack of Three) 6.0
    10 Ecomom Sangol Organic Rice Snacks – Apple 7.5

    Running the above code on the table printed out the following result:

    1 Organic Rice Snacks 6.0  
    2 Organic Rice Snacks 5.0  
    3 62-Sam Almond Crisp Jujube 4.0  
    4 62-Sam Almond Crisp Jujube 5.0  
    5 [Sale] Extra Virgin Sprouted Perilla Oil Best Before 9/3 or 9/11/2021 3.0  
    6 [Sale] Extra Virgin Sprouted Perilla Oil Best Before 9/3 or 9/11/2021 3.5  
    7 [Steak Family Set 4] Ribeye 12 oz. x 4 20.0  
    8 [Steak Family Set 4] Ribeye 12 oz. x 4 21.0  
    9 Ecomom Sangol Organic Rice Snacks 6.0  
    10 Ecomom Sangol Organic Rice Snacks 7.5
    

    Of course, you’re using a mysql db, so adapt the connection part in a mysql supported way. Also, that’s not giving back your grouped table, but it’s creating a list of "Normalized" objects, please notice that it should be very easy to write those objects as records in a table and execute all your desired queries after.

    IMPORTANT NOTE
    While that could be a solution, consider that your situation is a pure mess. First things first you should consider to have a dedicated table to your products, a very basic product_id –> product_plain_name and the use those ids as foreign keys where needed. Modifiers, such as frozen, pack of three, members-only, exc should be applied apart, you can use another foreign key column with the same approach used above: modifier_id –> modifier_name. Integer column where every number matches a specific modifier in the code could be another valid solution. One last thing, my code does not scale, that could do the job with a few hundred records, but it can’t clearly handle milions. Search for db design and normalization for an in depth explanations.

    Hope it helps,
    Hele.

    Login or Signup to reply.
  2. SELECT TRIM(SUBSTRING_INDEX(REPLACE(product, '(', ' - '), ' - ', 1)) AS NewProduct,
           SUM(total) GTotal
    FROM test
    GROUP BY 1
    
    NewProduct GTotal
    Organic Rice Snacks 11.00
    62-Sam Almond Crisp Jujube 9.00
    [Sale] Extra Virgin Sprouted Perilla Oil Best Before 9/3 or 9/11/2021 6.50
    [Steak Family Set 4] Ribeye 12 oz. x 4 41.00
    Ecomom Sangol Organic Rice Snacks 13.50

    fiddle

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search