skip to Main Content

The regular expression I am using is ^s*(d+)s*(([A-Za-z]+s*)+)?(d+)s+(.+?)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)$

When the following sample data string is parsed and categorized " 1 NA BEVERAGE 1100 ICED TEA 14.00 3.00 42.00 3.50 0.00 42.00 0.00 0.52 47.09"

Output is incorrect: when you look at the categorized data before converting it to JSON you see 'item_category': 'NA BEVERAGE ', 'item_number': 'BEVERAGE ' It should be 'item_category': 'NA BEVERAGE ', 'item_number':'1100' and so on.

I expect:

{'item_rank': '1', 'item_category': 'NA BEVERAGE ', 'item_number': 'BEVERAGE ', 'item_name': '1100', 'number_sold': 'ICED TEA', 'price_sold': '14.00', 'amount': '3.00', 'tax': '42.00', 'cost': '3.50', 'profit': '0.00', 'food_cost': '42.00', 'precent_sales': '0.00', 'cat_sales': '0.52'}

I tried fixing the regular expression multiple times to no avail. An explanation of what’s wrong is appreciated.

Here is logic of the python script you can copy and run on your own machine:

import re
import json

page_text_str = "   1 NA BEVERAGE 1100 ICED TEA 14.00 3.00 42.00 3.50 0.00 42.00 0.00 0.52 47.09"

sale_line_re = re.compile('^s*(d+)s*(([A-Za-z]+s*)+)?(d+)s+(.+?)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)$')
grouped_data = []

for line in page_text_str.split('n'):
    print(line)   
    match = sale_line_re.match(line)
    if match:
        groups = match.groups()
        item = {
            "item_rank": groups[0],
            "item_category": groups[1],
            "item_number": groups[2],
            "item_name": groups[3],
            "number_sold": groups[4],
            "price_sold": groups[5],
            "amount": groups[6],
            "tax": groups[7],
            "cost": groups[8],
            "profit": groups[9],
            "food_cost": groups[10],
            "precent_sales": groups[11],
            "cat_sales": groups[12]
        }
        grouped_data.append(item)


for sale in grouped_data:
    print(sale)

2

Answers


  1. Instead of building a regex to describe all the numbers, etc., it would be easier to use the re.split function by spaces between numbers and at the same time ignore a space between words. This function returns a list, and then you can iterate over it to build a JSON.

    (?<=d)s|s(?=d)

    • (?<=d), lookbehind: everything that goes after a digit
    • (?=d), lookahead: everything that goes before a digit
    • s|s – matches any whitespace before or after a digit.

    regex101.com

    Login or Signup to reply.
  2. The issue is that you are repeating a capture group, which will have the group value of the last iteration.

    You can change (([A-Za-z]+s*)+)? –> ((?:[A-Za-z]+s+)+)?

    This change will:

    • repeat a non capture group, so now you have the whole value in group 2

    • repeat the whitespace chars 1+ times in the repeating of the non capture group

      ^s*(d+)s*((?:[A-Za-z]+s+)+)?(d+)s+(.+?)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)s+(d+.d+)$
      

    See the updated pattern in this regex demo and the Python code

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search