I am trying to scrape a list of website on Shopee. Some example include dudesgadget and 2ubest. Each of these shopee shop have different design and way of constructing their web element and different domain as well. They looks like stand alone website but they are actually not.
So the main problem here is I am trying to scrape the product details. I will summarize some different structure:
2ubest
<html>
<body>
<div id="shopify-section-announcement-bar" id="shopify-section-announcement-bar">
<main class="wrapper main-content" role="main">
<div class="grid">
<div class="grid__item">
<div id="shopify-section-product-template" class="shopify-section">
<script id="ProductJson-product-template" type="application/json">
//Things I am looking for
</script>
</div>
</div>
</div>
</main>
</div>
</body>
</html>
<html>
<body id="adjustable-ergonomic-laptop-stand" class="template-product">
<script>
//Things I am looking for
</script>
</body>
</html>
And few other, and I had discover a pattern between them.
- The thing that I am looking for will for sure in
<body>
- The thing that I am looking for is within a
<script>
- The only thing that I not sure is the distance from
<body>
to<script>
My solution is:
def parse(self, response):
body = response.xpath("//body")
for script in body.xpath("//script/text()").extract():
#Manipulate the script with js2xml here
I am able to extract the littleplayland, dailysteals and many others which has very less distance from the <body>
to <script>
, but does not works for the 2ubest which has a lot of other html element in between to the thing I am looking for. Can I know are there solution that I can ignore all the html element in between and only look for the <script>
tag?
I need a single solution that are generic and can work across all Shopee website if possible since all of them have the characteristic that I had mention above.
Which mean that the solution should not filter using <div>
because every different website have different numbers of <div>
2
Answers
This is how to get the scripts in your HTML using Scrapy:
Here is an example with the HTML in your question (I added an extra script 🙂 ):
Try this: