I am trying to set up automatic Web Scraping from a website using python to store the html data and make a JSON file that uses a specific format. I already have the JSON file templet and have been able to get the HTML data as a .text file using BeautifulSoup. However I can not figure out how I can select specific parts of the data without directly changing the code. Is there something I can do or would it be necessary to plug in all of that data myself?
Thanks and below is the code I am using.
import requests
from bs4 import BeautifulSoup
# need to automate page swaping but for now test
# need to inciment over tr class-2 ->class 895 page = requests.get('https://www.finalfantasyd20.com/bestiary)
page = requests.get('https://www.finalfantasyd20.com/bestiary/undead/abadon/')
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id= 'main')
Name = soup.find(id='abadon')
# print(Name.text)
# Type = soup.find() not gonna work with how this is caus no header
Stats = results.find_all('p')
for stat in Stats:
print(stat.text)
Str = stat.find(string='Str')
print(Str)
I have tried a number of attempts to isolate the specific value without putting it in myself but have continued to fail.
2
Answers
As I try, the print(Str) output nothing. Maybe you need this:
As I undertand you want to scrappe stats below STATISTICS header (h5).
As you see that below STATISTICS there a paragraph
and its childrens are you target:
we could see this as a tree where p is the parent node and
are children nodes
one solution is to :
1/ find childrens with strong tag and ‘stat’ string inside it where stat could be Str or DEX … [ in your case stat.find("strong",string=’Str’)]
2/ navigate to its next sibling to extract the correspending value [ Str.next_sibling]
check out BeatifulSoup official documentation to find more https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=next_sibling#next-sibling-and-previous-sibling
here is a patched version of your code
you can do the same for other stats.