Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Webscraping Python Website Using JSON Application

Seedizens
January 21, 2023
270 views
0 votes
2 Answers

I am trying to get the price of one item on the website in the url below. However, I am finding some issues when looking at the source page of the website.

The url is: https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love

The part of the source page I am interested in is the following (I guess):

<script type="application/ld+json">
    [{

"@context":"http://schema.org",
"@type":"Product",
"productID":"25372685655708131",
"name":"LOVE bracelet, small model",
"description":"#LOVE# bracelet, small model, yellow gold 750/1000. Supplied with a screwdriver. Width: 3.65 mm (for size 17). Now available in a slimmer version, Cartier continues to write the story of the #LOVE# bracelet. Same design, same oval shape, same story: a timeless – yet slightly slimmer – creation which is fastened using a screwdriver. The closure is designed with a functional screw on one side of the bracelet and a hinge on the other. To determine the size of your #LOVE# bracelet, measure your wrist, adding one centimetre to your size for a tighter fit, or two centimetres for a looser fit.",
"image":["https://www.cartier.com/variants/images/25372685655708131/img1/w960.jpg"],
"offers": 
[{"@type":"Offer","availability":"http://schema.org/InStock","priceCurrency":"GBP","price":"4100","sku":"0400574782829","url":"https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html"}]}]
    </script>

I have tried the following steps:

import json
from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool
import pandas as pd

data = {'url':[],'offers_price':[]}

def get_price(url):
    soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).content, "html.parser")
    data = json.loads(soup.find_all('script', {'type': 'application/ld+json'})[-1].get_text())
    return url, int(data['offers']['price'])

if __name__ == '__main__':

    urls = ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love']

    with Pool(processes=4) as pool:
            for url, price in pool.imap_unordered(get_price, urls):
                    data['offers_price'].append(price)
                    data['url'].append(url)
    print(data)

But not successful. How would you approach in this case?

Answers

I was able to get the price, but I got it from the product-price tag:

import json
from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool
import pandas as pd

data = {'url':[],'offers_price':[]}

def get_price(url):
    soup = BeautifulSoup(requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).content, "html.parser")
    data = json.loads(soup.find_all('product-price')[-1]['data-model'])
    return url, int(data['fullPrice'])

if __name__ == '__main__':

    urls = ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love']

    with Pool(processes=4) as pool:
            for url, price in pool.imap_unordered(get_price, urls):
                    data['offers_price'].append(price)
                    data['url'].append(url)
    print(data)

Output:

{'url': ['https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love'], 'offers_price': [4100]}

By the way, are you sure you want to append the url and the price? I think you should do this instead:

data['offers_price'] = price
data['url'] = url

You can also do it with regular expressions, extracting the necessary information from inline JSON.

In order to extract data from inline JSON you need:

open page source CTRL + U;
find the data (price, title etc.) CTRL + F;
using regular expression to extract parts of the inline JSON:

# https://regex101.com/r/EPJoTk/1 
portion_of_script = re.findall("[{"@context":(.*)", str(all_script))

After we extract the price:

# https://regex101.com/r/az0sSf/1
currency = re.search(""priceCurrency":"(.*?)"", str(portion_of_script)).group(1)

# https://regex101.com/r/ngCxwm/1
price = re.search(""price":"(.*?)"", str(portion_of_script)).group(1)

Also, if it will be useful for you, I have an answer to the question about scraping cartier.com with pagination.

Check code in the online IDE.

from bs4 import BeautifulSoup
import requests, re, lxml

# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
}
   
page = requests.get("https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html#dept=EU_Love", headers=headers, timeout=30)
soup = BeautifulSoup(page.text, "lxml")
all_script = soup.select("script")

# https://regex101.com/r/EPJoTk/1 
portion_of_script = re.findall("[{"@context":(.*)", str(all_script))

# https://regex101.com/r/az0sSf/1
currency = re.search(""priceCurrency":"(.*?)"", str(portion_of_script)).group(1)

# https://regex101.com/r/ngCxwm/1
price = re.search(""price":"(.*?)"", str(portion_of_script)).group(1)

url = re.search(""url":"(.*?)"", str(portion_of_script)).group(1)

print(currency, price, url, sep="n")

Output:

GBP
4250
https://www.cartier.com/en-gb/love-bracelet-small-model_cod25372685655708131.html

Please signup or login to give your own answer.

Click here to cancel reply.