skip to Main Content

I am trying to get the price of one item on the website in the url below. However, I am finding some issues when looking at the source page of the website.

The url is: https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love

The part of the source page I am interested in is the following (I guess):

<script type="application/ld+json">
    [{

"@context":"http://schema.org",
"@type":"Product",
"productID":"25372685655708131",
"name":"LOVE bracelet, small model",
"description":"#LOVE# bracelet, small model, yellow gold 750/1000. Supplied with a screwdriver. Width: 3.65 mm (for size 17). Now available in a slimmer version, Cartier continues to write the story of the #LOVE# bracelet. Same design, same oval shape, same story: a timeless – yet slightly slimmer – creation which is fastened using a screwdriver. The closure is designed with a functional screw on one side of the bracelet and a hinge on the other. To determine the size of your #LOVE# bracelet, measure your wrist, adding one centimetre to your size for a tighter fit, or two centimetres for a looser fit.",
"image":["https://www.cartier.com/variants/images/25372685655708131/img1/w960.jpg"],
"offers": 
[{"@type":"Offer","availability":"http://schema.org/InStock","priceCurrency":"GBP","price":"4100","sku":"0400574782829","url":"https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html"}]}]
    </script>

I have tried the following steps:

import json
from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool
import pandas as pd

data = {'url':[],'offers_price':[]}

def get_price(url):
    soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).content, "html.parser")
    data = json.loads(soup.find_all('script', {'type': 'application/ld+json'})[-1].get_text())
    return url, int(data['offers']['price'])

if __name__ == '__main__':

    urls = ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love']

    with Pool(processes=4) as pool:
            for url, price in pool.imap_unordered(get_price, urls):
                    data['offers_price'].append(price)
                    data['url'].append(url)
    print(data)

But not successful. How would you approach in this case?

2

Answers


  1. I was able to get the price, but I got it from the product-price tag:

    import json
    from bs4 import BeautifulSoup
    import requests
    from multiprocessing import Pool
    import pandas as pd
    
    data = {'url':[],'offers_price':[]}
    
    def get_price(url):
        soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).content, "html.parser")
        data = json.loads(soup.find_all('product-price')[-1]['data-model'])
        return url, int(data['fullPrice'])
    
    if __name__ == '__main__':
    
        urls = ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love']
    
        with Pool(processes=4) as pool:
                for url, price in pool.imap_unordered(get_price, urls):
                        data['offers_price'].append(price)
                        data['url'].append(url)
        print(data)
    

    Output:

    {'url': ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love'], 'offers_price': [4100]}
    

    By the way, are you sure you want to append the url and the price? I think you should do this instead:

    data['offers_price'] = price
    data['url'] = url
    
    Login or Signup to reply.
  2. You can also do it with regular expressions, extracting the necessary information from inline JSON.

    In order to extract data from inline JSON you need:

    1. open page source CTRL + U;
    2. find the data (price, title etc.) CTRL + F;
    3. using regular expression to extract parts of the inline JSON:
    # https://regex101.com/r/EPJoTk/1 
    portion_of_script = re.findall("[{"@context":(.*)", str(all_script))
    

    After we extract the price:

    # https://regex101.com/r/az0sSf/1
    currency = re.search(""priceCurrency":"(.*?)"", str(portion_of_script)).group(1)
    
    # https://regex101.com/r/ngCxwm/1
    price = re.search(""price":"(.*?)"", str(portion_of_script)).group(1)
    

    Also, if it will be useful for you, I have an answer to the question about scraping cartier.com with pagination.

    Check code in the online IDE.

    from bs4 import BeautifulSoup
    import requests, re, lxml
    
    # https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
    }
       
    page = requests.get("https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love", headers=headers, timeout=30)
    soup = BeautifulSoup(page.text, "lxml")
    all_script = soup.select("script")
    
    # https://regex101.com/r/EPJoTk/1 
    portion_of_script = re.findall("[{"@context":(.*)", str(all_script))
    
    # https://regex101.com/r/az0sSf/1
    currency = re.search(""priceCurrency":"(.*?)"", str(portion_of_script)).group(1)
    
    # https://regex101.com/r/ngCxwm/1
    price = re.search(""price":"(.*?)"", str(portion_of_script)).group(1)
    
    url = re.search(""url":"(.*?)"", str(portion_of_script)).group(1)
    
    print(currency, price, url, sep="n")
    

    Output:

    GBP
    4250
    https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search