skip to Main Content

I am trying to scrape a list of website on Shopee. Some example include dudesgadget and 2ubest. Each of these shopee shop have different design and way of constructing their web element and different domain as well. They looks like stand alone website but they are actually not.

So the main problem here is I am trying to scrape the product details. I will summarize some different structure:

2ubest

<html>
    <body>
        <div id="shopify-section-announcement-bar" id="shopify-section-announcement-bar">
            <main class="wrapper main-content" role="main">
                <div class="grid">
                    <div class="grid__item">
                        <div id="shopify-section-product-template" class="shopify-section">
                            <script id="ProductJson-product-template" type="application/json">
                                //Things I am looking for
                            </script>
                        </div>
                    </div>
                </div>
            </main>
        </div>
    </body>
</html>

littleplayland

<html>
    <body id="adjustable-ergonomic-laptop-stand" class="template-product">
        <script>
            //Things I am looking for
        </script>
    </body>
</html>

And few other, and I had discover a pattern between them.

  1. The thing that I am looking for will for sure in <body>
  2. The thing that I am looking for is within a <script>
  3. The only thing that I not sure is the distance from <body> to <script>

My solution is:

def parse(self, response):
    body = response.xpath("//body")
    for script in body.xpath("//script/text()").extract():
        #Manipulate the script with js2xml here

I am able to extract the littleplayland, dailysteals and many others which has very less distance from the <body> to <script>, but does not works for the 2ubest which has a lot of other html element in between to the thing I am looking for. Can I know are there solution that I can ignore all the html element in between and only look for the <script> tag?

I need a single solution that are generic and can work across all Shopee website if possible since all of them have the characteristic that I had mention above.

Which mean that the solution should not filter using <div> because every different website have different numbers of <div>

2

Answers


  1. This is how to get the scripts in your HTML using Scrapy:

    scriptTagSelector = scrapy.Selector(text=text, type="html")
    theScripts = scriptTagSelector.xpath("//script/text()").extract()
    
    for script in theScripts:
        #Manipulate the script with js2xml here
        print("------->A SCRIPT STARTS HERE<--------")
        print(script)
        print("------->A SCRIPT ENDS HERE<--------")
    

    Here is an example with the HTML in your question (I added an extra script 🙂 ):

    import scrapy
    
    text="""<html>
        <body>
            <div id="shopify-section-announcement-bar" id="shopify-section-announcement-bar">
                <main class="wrapper main-content" role="main">
                    <div class="grid">
                        <div class="grid__item">
                            <div id="shopify-section-product-template" class="shopify-section">
                                <script id="ProductJson-product-template" type="application/json">
                                    //Things I am looking for
                                </script>
                            </div>
                            <script id="script 2">I am another script</script>
                        </div>
                    </div>
                </main>
            </div>
        </body>
    </html>"""
    
    scriptTagSelector = scrapy.Selector(text=text, type="html")
    theScripts = scriptTagSelector.xpath("//script/text()").extract()
    
    for script in theScripts:
        #Manipulate the script with js2xml here
        print("------->A SCRIPT STARTS HERE<--------")
        print(script)
        print("------->A SCRIPT ENDS HERE<--------")
    
    Login or Signup to reply.
  2. Try this:

    //body//script/text()
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search