skip to Main Content

So I have been getting some mixed answers here. Either to run with regex or not.

what I am trying to do is that I am trying to grab a specific value (The json of spConfig) in the html which is:

<script type="text/x-magento-init">
        {
            "#product_addtocart_form": {
                "configurable": {
                    "spConfig": {"attributes":{"93":{"id":"93","code":"color","label":"Color","options":[{"id":"8243","label":"Helloworld","products":["97460","97459"]}],"position":"0"},"148":{"id":"148","code":"codish","label":"Codish","options":[{"id":"4707","label":"12.5","products":[]},{"id":"2724","label":"13","products":[]},{"id":"4708","label":"13.5","products":[]}],"position":"1"}},"template":"EUR <%- data.price %>","optionPrices":{"97459":{"oldPrice":{"amount":121},"basePrice":{"amount":121},"finalPrice":{"amount":121},"tierPrices":[]}},"prices":{"oldPrice":{"amount":"121"},"basePrice":{"amount":"121"},"finalPrice":{"amount":"121"}},"productId":"97468","chooseText":"Choose an Option...","images":[],"index":[]},
                    "gallerySwitchStrategy": "replace"
                }
            }
        }
    </script>

and here is the problem. When scraping the HTML, there is multiply <script type="text/x-magento-init"> but only one spConfig and I have two question here.

  1. Should I grab the value spConfig using Regex to later use json.loads(spConfigValue) or not? If not then what method should I use to scrape the json value?

  2. If I am supposed to regex. I have been trying to do grab it using "spConfig": (.*?) however it is not scraping the json value for me. what am I doing wrong?

3

Answers


  1. No, don’t ever use regex for HTML. Use HTML-parsers like BeautifulSoup instead!

    Login or Signup to reply.
  2. So basically for json use json parser right. ? 🤔 And for yaml use yamel parser 🤔 so in HTML do use HTML parser
    See some example and also like that will make you life to shine

    from html.parser import HTMLParser

    class MyHTMLParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            print("Encountered a start tag:", tag)
    
       def handle_endtag(self, tag):
          print("Encountered an end tag :", tag)
    
      def handle_data(self, data):
          print("Encountered some data  :", data)
    
     parser = MyHTMLParser()
    parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')
    

    https://docs.python.org/3/library/html.parser.html

    Login or Signup to reply.
  3. In this case, with bs4 4.7.1 + :contains is your friend. You say there is only a single match for that so you can do the following:

    from bs4 import BeautifulSoup as bs
    import json
    
    html= '''<html>
     <head>
      <script type="text/x-magento-init">
            {
                "#product_addtocart_form": {
                    "configurable": {
                        "spConfig": {"attributes":{"93":{"id":"93","code":"color","label":"Color","options":[{"id":"8243","label":"Helloworld","products":["97460","97459"]}],"position":"0"},"148":{"id":"148","code":"codish","label":"Codish","options":[{"id":"4707","label":"12.5","products":[]},{"id":"2724","label":"13","products":[]},{"id":"4708","label":"13.5","products":[]}],"position":"1"}},"template":"EUR <%- data.price %>","optionPrices":{"97459":{"oldPrice":{"amount":121},"basePrice":{"amount":121},"finalPrice":{"amount":121},"tierPrices":[]}},"prices":{"oldPrice":{"amount":"121"},"basePrice":{"amount":"121"},"finalPrice":{"amount":"121"}},"productId":"97468","chooseText":"Choose an Option...","images":[],"index":[]},
                        "gallerySwitchStrategy": "replace"
                    }
                }
            }
        </script>
     </head>
     <body></body>
    </html>'''
    soup = bs(html, 'html.parser')
    data = json.loads(soup.select_one('script:contains(spConfig)').text)
    

    Config is then:

    data['#product_addtocart_form']['configurable']['spConfig']
    

    with keys:

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search