Problem with table targeting BeautifulSoup Python 3 - Html

ukaszZedler
February 25, 2023
163 views
0 votes
2 Answers

I try to target the table from the page. This seems to be a trivial task, however for some reason I cannot scrape this one particular table.

Link to the page

I am trying to access the table located here:

#price-history-chart > div > div:nth-child(1) > div > div > table

import requests
import urllib.request
from bs4 import BeautifulSoup 
import pandas as pd

URL = 'https://www.dekudeals.com/items/buy-the-game-i-have-a-gun-sheesh-man-digital-deluxe-mega-chad-edition?format=digital'

page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')

# getting pricing tables
tables = soup.findAll('table')

The code above gives me :

[<table class="table table-align-middle item-price-table">
<tr class="">
<td>
<a href="https://store.playstation.com/en-pl/product/EP7603-CUSA41273_00-0623688910030618" rel="nofollow noopener" style="padding: 0 0.75rem;" target="_blank">
<div class="logo playstation">
<img alt="PlayStation Store" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"/>
</div>
</a>
</td>
<td class="version">
<a class="text-dark" href="https://store.playstation.com/en-pl/product/EP7603-CUSA41273_00-0623688910030618" rel="nofollow noopener" target="_blank">
<span class="text-muted">PS4</span>
<br/>
Digital
</a>
</td>
<td>
<a href="https://store.playstation.com/en-pl/product/EP7603-CUSA41273_00-0623688910030618" rel="nofollow noopener" target="_blank">
<div class="btn btn-block btn-primary">
54,00 zł
</div>
</a>
</td>
</tr>
</table>, <table>
<tr>
<td class="text-center" colspan="2" style="border-bottom: 1px solid #cccccc;">
<strong>All time low</strong>
</td>
</tr>
<tr>
<td class="text-right pl-3">
54,00 zł
</td>
<td>
   
</td>
</tr>
</table>]

When I want to target it using selectors

tables = soup.select_one('price-history-chart > div > div:nth-child(1) > div > div > table')

Then the value None is assigned.

The only option that works uses selenium and is not clear to me. I would like to understand what I am coding.

import pandas as pd
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

#creating new chrome section
service = Service(executable_path=".../Driver/chromedriver")
driver = webdriver.Chrome(service=service)

#Loading the webpage
driver.get(url)
driver.implicitly_wait(5)
wait = WebDriverWait(driver, 25)

soup = BeautifulSoup(driver.page_source, 'lxml')

# Finding all tables
tables = soup.findAll('table')

pricing_history_list = pd.read_html(str(tables[2]))

Can someone give me a solution, and what is the most important, the reason why I cannot target the table?

Answers

Maybe you can try something like this:

import json
import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.dekudeals.com/items/buy-the-game-i-have-a-gun-sheesh-man-digital-deluxe-mega-chad-edition?format=digital")

soup = BeautifulSoup(page.text, 'html.parser')
script_tag = soup.find('script', {'id': 'price_history_data'})
json_data = json.loads(script_tag.string)
print(json_data['data'])

here, json_data contains the table you’re targeting.

- thelmo
- February 25, 2023 at 10:43 pm
- 0 votes
0
I have had similar problems, and want to quickly expand on the other answers. BeautifulSoup works well for static html pages, so all html source code that you can usually find when pressing Ctrl+U or similar. As mentioned by another commenter, some websites use javascript or json to dynamically load extra content, which will not be visible in the static html source code, but can be picked up by Selenium.

If you look at the source code of your URL (with Ctrl+U or similar, not in the developer tools of some browsers, which can catch dynamically loaded elements that BeautifulSoup will not have access to), you can see that the only two <table>...</table> that exist do not actually contain the data that I suppose you are looking for. The data is instead stored after <script id='price_history_data' type='application/json'>, which Ajeet Verma targets in his solution. Using json, the website then loads the table that you were trying to target, and the non-table becomes a table, which is too late for BeautifulSoup but can be picked up by Selenium. But because the data you (probably) want is already there outside of a <table>, you do not actually need Selenium, which would take longer than BeautifulSoup.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Problem with table targeting BeautifulSoup Python 3 – Html

Answers