skip to Main Content

I am trying to write a program that can take any recipe made in WPRM (wordpress recipe maker) and automatically rescale it. It is supposed to take the HTML code of the print version of such a recipe (see for example this cookie recipe) and extract the relevant information (amount, unit, name) for each ingredient using regular expressions. The relevant parts are enclosed by something along the lines of …ingredient-attribute"> and . However, this sometimes is not recognized.

This is the relevant part of the code:

ingredients = re.split(r'<li', recipestr_cut) 
    
    for x in ingredients:
        
        out = [] # initialize a single element of the recipe as a list
        
        amount_m = re.search(r'ingredient-amount">(.+)</span>', x)
        if amount_m:
            out.append(amount_m.group(1))
        unit_m = re.search(r'ingredient-unit">(.+)</span>', x)
        if unit_m:
            out.append(unit_m.group(1))
        ingredient_m = re.search(r'ingredient-name">(.+)</span>', x)
        if ingredient_m:
            out.append(ingredient_m.group(1))
        
        if len(out) > 0:
            recipe_readable.append(out)

For the recipe in the example before, this works neatly and returns

[['3', 'cups', '(380 grams) all-purpose flour'], ['1', 'teaspoon', 'baking soda'], ['1', 'teaspoon', 'fine sea salt'], ['2', 'sticks (227 grams) unsalted butter, at cool room temperature (67°F)'], ['1/2', 'cup', '(100 grams) granulated sugar'], ['1 1/4', 'cups', '(247 grams) lightly packed light brown sugar'], ['2', 'teaspoons', 'vanilla'], ['2', 'large eggs, at room temperature'], ['2', 'cups', '(340 grams) semisweet chocolate chips']]

However, if one instead uses a somewhat more complicated recipe, for example this pork bun recipe, it apparently no longer recognizes the </span> delimiter, because the output looks like this:

[['2/3</span>&#32;<span class="wprm-recipe-ingredient-unit">cup</span>&#32;<span class="wprm-recipe-ingredient-name">heavy cream', 'cup</span>&#32;<span class="wprm-recipe-ingredient-name">heavy cream</span>&#32;<span class="wprm-recipe-ingredient-notes wprm-recipe-ingredient-notes-faded">(at room temperature)', ...

Which clearly includes the </span>, and also does not seem to split the string in the right places. Instead, I would expect something like

[['2/3','cup','heavy cream'],...

How could this be fixed? I have only very recently learned about regular expressions, so this is still scary to me.

2

Answers


  1. Problem

    When you have multiple ingredients on one line of the string, (which is legal HTML) this regex becomes ambiguous:

    ingredient-amount">(.+)</span>
    

    Regex101 link

    It could match this:

    ingredient-amount">2/3</span>
    

    but it could also match this:

    ingredient-amount">2/3</span>&#32;<span class="wprm-recipe-ingredient-unit">cup</span>
    

    The way it resolves this ambiguity is to choose the longest match. This is called greedy matching.

    Solution #1

    You want it to match in a lazy way, and match as few characters as possible. This is a lazy match. Example:

    ingredient-amount">(.+?)</span>
    

    Solution #2

    The previous way will work, but my preferred solution is to make it so that the "match anything" part of the regex won’t skip over any span tags. That would look like this:

    ingredient-amount">([^<]+)</span>
    

    This changes the "match anything" part to "match anything but <"

    I like this kind of solution because it’s easier to debug.

    Login or Signup to reply.
  2. Instead of regex, use an HTML parser, like BeautifulSoup:

    from bs4 import BeautifulSoup
    
    def get_ingredients_info_from_html(html):
      soup = BeautifulSoup(html, 'html.parser')
    
      ingredients = []
      class_name = 'wprm-recipe-ingredient'
    
      for ingredient in soup.select(f'.{class_name}'):
        record = dict.fromkeys(['amount', 'unit', 'name', 'notes'])
    
        for info in record:
          element = ingredient.select_one(f'.{class_name}-{info}')
    
          if element:
            record[info] = element.text
        
        ingredients.append(record)
    
      return ingredients
    

    Try it:

    import requests
    
    url = 'https://thewoksoflife.com/wprm_print/31141'
    
    print(get_ingredients_info_from_html(requests.get(url).content))
    
    '''
    [
        {
            'amount': '2/3',
            'unit': 'cup',
            'name': 'heavy cream',
            'notes': '(at room temperature)'
        },
        {
            'amount': '1',
            'unit': 'cup',
            'name': 'milk',
            'notes': '(whole milk preferred, but you can use 2%, at room 
    temperature)'
        },
        ...
    ]
    '''
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search