skip to Main Content

I want to extract info from this website using Scrapy. But the info I need is in a JSON file; and this JSON file has unwanted literal newlines characters in only the description section.

Here is an example page and the JSON element I want to scrape is this

<script type="application/ld+json">
    {
      "@context": "http://schema.org",
      "@type": "Product",
            "description": "Hamster ve Guinea Pig için tasarlanmış temizliği kolay mama kabıdır.

Hamster motifleriyle süslü ve son derece sevimlidir.

Ürün seramikten yapılmıştır 

Ürün ölçüleri 


    Hacim: 100 ml
    Çap: 8 cm",
      "name": "Karlie Seramik Hamster ve Guinea Pigler İçin Yemlik 100ml 8cm",
      "image": "https://www.petlebi.com/up/ecommerce/product/lg_karlie-hamster-mama-kaplari-359657192.jpg",
      "brand": {
        "@type": "Brand",
        "name": "Karlie"
      },
      "category": "Guinea Pig Yemlikleri",
      "sku": "4016598440834",
      "gtin13": "4016598440834",
      "offers": {
        "@type": "Offer",
         "availability": "http://schema.org/InStock",
         "price": "149.00",
        "priceCurrency": "TRY",
        "itemCondition": "http://schema.org/NewCondition",
        "url": "https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html"
      },
      "review": [
            ]
    }
    </script>

As you can see there are literal newline characters in the description, which are not allowed in JSON. Here is the code I was trying but it didn’t work:

import scrapy
import json
import re

class JsonSpider(scrapy.Spider):
    name = 'json_spider'
    start_urls = ['https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html']

    def parse(self, response):
        # Extract the script content containing the JSON data
        script_content = response.xpath('/html/body/script[12]').get()

        if not script_content:
            self.logger.warning("Script content not found.")
            return

        json_data_match = re.search(r'<script type="application/ld+json">(.*?)</script>', script_content, re.DOTALL)
        if json_data_match:
            json_data_str = json_data_match.group(1)
            try:
                json_obj = json.loads(json_data_str)

                product_info = {
                    "name": json_obj.get("name"),
                    "description": json_obj.get("description"),
                    "image": json_obj.get("image"),
                    "brand": json_obj.get("brand", {}).get("name"),
                    "category": json_obj.get("category"),
                    "sku": json_obj.get("sku"),
                    "price": json_obj.get("offers", {}).get("price"),
                    "url": json_obj.get("offers", {}).get("url")
                }

                self.logger.info("Extracted Product Information: %s", product_info)

                with open('product_info.json', 'w', encoding='utf-8') as json_file:
                    json.dump(product_info, json_file, ensure_ascii=False, indent=2)

            except json.JSONDecodeError as e:
                self.logger.error("Error decoding JSON: %s", e)

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html',
            callback=self.parse,
        )

I want this to be a dynamic code so it works for every product.

I used https://jsonlint.com/ to see the unwanted characters and when i delete the escape characters in the description it says it is valid. I tried html.unescape but it didn’t work. The code stops working in this line:
json_obj = json.loads(json_data_str) How can I do it?

2

Answers


  1. Just remove the specific char from the response text before converted into json object like this

    json_data_str.replace("n","").replace("r","").replace("t","")
    

    Or you can specify the parameter strict on json.loads function

    json.loads(json_data_str,strict=False)
    
    Login or Signup to reply.
  2. Replacing newlines only within the "description": value is a little bit more involved than I’d like, but try this.

                    json_data_str_fixed = re.sub(
                        r'"description": "[^"]*(n[^"]*)*"',
                        lambda x: re.sub(r"n", r"\n", x.group(0)),
                        json_data_str)
                    json_obj = json.loads(json_data_str_fixed)
    

    In so many words, the outer re.sub selects the "desription": key and value, including any newlines, and replaces it with … the same string with the newlines replaced with escaped newlines by the inner re.sub.

    If you don’t want to preserve the newlines at all, of course, this is much simpler; just

                    json_obj = json.loads(json_data_str.replace("n", "")
    

    but understand that this will turn e.g. yapılmıştır(newline)(newline)Ürün into yapılmıştırÜrün which probably isn’t what you want.

    Using json.loads(..., strict=False) as suggested in the other answer is probably easier in your scenario; but I wanted to provide an answer which can be adapted to scenarios where this doesn’t work. (I would upvote, and suggest you accept, the other answer if it didn’t suggest munging the text as its primary solution.)

    Demo: https://ideone.com/okXYok

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search