skip to Main Content

So I have this text that I extracted out of a <script> tag.

                                        function fbq_w123456as() {
                                                fbq('track', 'AddToCart', {
                                                contents: [
                                                {
                                                        'id': '123456',
                                                        'quantity': '',
                                                        'item_price':69.99                                                        }
                                                ],
                                                content_name: 'Stackoverflow',
                                                content_category: '',
                                                content_ids: ['w123456as'],
                                                content_type: 'product',
                                                value: 420.69,
                                                currency: 'USD'
                                                });
                                        }

I’m trying to extract this information using regex and later converting it into JSON using python.
I’ve tried re.search(r"'AddToCart', (.*?);" and a few other attempts but no luck. I am very new to regex and I am struggling with it.

{
    "contents":[
        {
            "id":"123456",
            "quantity":"",
            "item_price":69.99
        }
    ],
    "content_name":"Stackoverflow",
    "content_category":"",
    "content_ids":[
        "w123456as"
    ],
    "content_type":"product",
    "value":420.69,
    "currency":"USD"
}

How would I create the regex to extract the JSON data?

2

Answers


  1. Extracting JSON data from a string using regular expressions can be a bit tricky, especially when the JSON structure is embedded within other content. However, in the given example, we can try to extract the JSON data using regex. Here’s an example of how you could do it in Python:

    import re
    import json
    
    html_script = '''
    function fbq_w123456as() {
        fbq('track', 'AddToCart', {
            contents: [
                {
                    'id': '123456',
                    'quantity': '',
                    'item_price': 69.99
                }
            ],
            content_name: 'Stackoverflow',
            content_category: '',
            content_ids: ['w123456as'],
            content_type: 'product',
            value: 420.69,
            currency: 'USD'
        });
    }
    '''
    
    # Search for the JSON data using regex
    match = re.search(r"fbq('track', 'AddToCart', ({[sS]*?}));", html_script)
    
    if match:
        json_data = match.group(1)
        # Convert the extracted JSON data to a Python dictionary
        data_dict = json.loads(json_data)
        # Print the dictionary or perform any further processing
        print(data_dict)
    else:
        print("No match found.")
    

    This code uses the re.search function to find the JSON data within the html_script string. The regular expression pattern fbq('track', 'AddToCart', ({[sS]*?})); matches the desired JSON data by looking for the specific string 'track', 'AddToCart', and then capturing the content inside the curly braces {}.

    If a match is found, the captured JSON data is extracted usingmatch.group(1). Then, json.loads is used to convert the JSON data into a Python dictionary (data_dict).

    Keep in mind that using regular expressions to parse complex data structures like JSON is not recommended in general, as it can be error-prone. It’s usually better to use a dedicated JSON parser or HTML parser library for more reliable and maintainable code.

    Login or Signup to reply.
  2. You can try:

    import re
    from ast import literal_eval
    
    js_txt = """
        function fbq_w123456as() {
                fbq('track', 'AddToCart', {
                contents: [
                {
                        'id': '123456',
                        'quantity': '',
                        'item_price':69.99                                                        }
                ],
                content_name: 'Stackoverflow',
                content_category: '',
                content_ids: ['w123456as'],
                content_type: 'product',
                value: 420.69,
                currency: 'USD'
                });
        }"""
    
    out = re.search(r"'AddToCart', ({.*?}));", js_txt, flags=re.S).group(1)
    out = re.sub(r"""([^"'s]+):""", r'"1":', out)
    out = literal_eval(out)
    print(out)
    

    Prints python dict:

    {
        "contents": [{"id": "123456", "quantity": "", "item_price": 69.99}],
        "content_name": "Stackoverflow",
        "content_category": "",
        "content_ids": ["w123456as"],
        "content_type": "product",
        "value": 420.69,
        "currency": "USD",
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search