I want to extract info from this website using Scrapy. But the info I need is in a JSON file; and this JSON file has unwanted literal newlines characters in only the description section.
Here is an example page and the JSON element I want to scrape is this
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Product",
"description": "Hamster ve Guinea Pig için tasarlanmış temizliği kolay mama kabıdır.
Hamster motifleriyle süslü ve son derece sevimlidir.
Ürün seramikten yapılmıştır
Ürün ölçüleri
Hacim: 100 ml
Çap: 8 cm",
"name": "Karlie Seramik Hamster ve Guinea Pigler İçin Yemlik 100ml 8cm",
"image": "https://www.petlebi.com/up/ecommerce/product/lg_karlie-hamster-mama-kaplari-359657192.jpg",
"brand": {
"@type": "Brand",
"name": "Karlie"
},
"category": "Guinea Pig Yemlikleri",
"sku": "4016598440834",
"gtin13": "4016598440834",
"offers": {
"@type": "Offer",
"availability": "http://schema.org/InStock",
"price": "149.00",
"priceCurrency": "TRY",
"itemCondition": "http://schema.org/NewCondition",
"url": "https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html"
},
"review": [
]
}
</script>
As you can see there are literal newline characters in the description, which are not allowed in JSON. Here is the code I was trying but it didn’t work:
import scrapy
import json
import re
class JsonSpider(scrapy.Spider):
name = 'json_spider'
start_urls = ['https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html']
def parse(self, response):
# Extract the script content containing the JSON data
script_content = response.xpath('/html/body/script[12]').get()
if not script_content:
self.logger.warning("Script content not found.")
return
json_data_match = re.search(r'<script type="application/ld+json">(.*?)</script>', script_content, re.DOTALL)
if json_data_match:
json_data_str = json_data_match.group(1)
try:
json_obj = json.loads(json_data_str)
product_info = {
"name": json_obj.get("name"),
"description": json_obj.get("description"),
"image": json_obj.get("image"),
"brand": json_obj.get("brand", {}).get("name"),
"category": json_obj.get("category"),
"sku": json_obj.get("sku"),
"price": json_obj.get("offers", {}).get("price"),
"url": json_obj.get("offers", {}).get("url")
}
self.logger.info("Extracted Product Information: %s", product_info)
with open('product_info.json', 'w', encoding='utf-8') as json_file:
json.dump(product_info, json_file, ensure_ascii=False, indent=2)
except json.JSONDecodeError as e:
self.logger.error("Error decoding JSON: %s", e)
def start_requests(self):
yield scrapy.Request(
url='https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html',
callback=self.parse,
)
I want this to be a dynamic code so it works for every product.
I used https://jsonlint.com/ to see the unwanted characters and when i delete the escape characters in the description it says it is valid. I tried html.unescape
but it didn’t work. The code stops working in this line:
json_obj = json.loads(json_data_str)
How can I do it?
2
Answers
Just remove the specific
char
from theresponse
text before converted intojson object
like thisOr you can specify the parameter
strict
onjson.loads
functionReplacing newlines only within the
"description":
value is a little bit more involved than I’d like, but try this.In so many words, the outer
re.sub
selects the"desription":
key and value, including any newlines, and replaces it with … the same string with the newlines replaced with escaped newlines by the innerre.sub
.If you don’t want to preserve the newlines at all, of course, this is much simpler; just
but understand that this will turn e.g.
yapılmıştır
(newline)(newline)Ürün
intoyapılmıştırÜrün
which probably isn’t what you want.Using
json.loads(..., strict=False)
as suggested in the other answer is probably easier in your scenario; but I wanted to provide an answer which can be adapted to scenarios where this doesn’t work. (I would upvote, and suggest you accept, the other answer if it didn’t suggest munging the text as its primary solution.)Demo: https://ideone.com/okXYok