skip to Main Content

I am trying to use this code for scraping:

from bs4 import BeautifulSoup
open_url = urllib.request.urlopen('https://en.wikipedia.org/wiki/Guitar')

guitars = open_url.read()

soup = BeautifulSoup(guitars, 'html.parser') 

soup

I am getting a result in HTML that includes the following line:

<div class="vector-sitenotice-container">
<div id="siteNotice"><!-- CentralNotice --></div>
</div>
<input class="mw-checkbox-hack-checkbox" id="vector-toc-collapsed-checkbox" type="checkbox"/>

But when I analyze the website, I see the following elements:

enter image description here

As you can see, the style element is missing and the <div dir = "ltr" id="w112021"> is missing. This includes de <a></a> element inside.

Why isn´t the code returning these elements?

2

Answers


  1. That’s probably because that HTML content you’re expecting is generated by some Javascript code.

    When browsing a website, the server returns some HTML which is initially loaded into your browser. This HTML tree may be modified by the JS in the page. This is a very common practice with many modern web frameworks. Thus, in some cases you cannot simply parse the HTML that the server sends you.

    To be able to read a website as if you were in a browser, you need to render the website and may need to execute the JS code in it. For this reason, frameworks like selenium exist. They allow you to control a web browser remotely and will allow you to scrape the data you need. See this tutorial for example, and the documentation. (pro tip: you might need to wait for that element to be present. I recommend you use WebDriverWait instead of adding sleep() everywhere.)

    However, be wary that, while the selenium bot will be several times faster than a human, depending on your needs, you might be able to take a faster and easier approach to scraping, such as intercepting the data that this same JS code fetches. To get the data from the website’s API, open your browser inspector’s ‘Network’ tab and see which queries are performed.

    Login or Signup to reply.
  2. I’ve found Selenium helpful in the past. I’ve recently been working with Playwright though and I’ve found it significantly faster. This may or may not matter in your particular case.

    There are versions for Python, Node.js, Java, and .NET. Playwright docs

    I’ve been using it in Python and aside from some confusion over CSS selectors it’s been relatively simple to learn. Oxylabs has some helpful guides on it.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search