Isolating Data while Web Scraping using python - Html

Zigboy
March 31, 2023
220 views
0 votes
2 Answers

I am trying to set up automatic Web Scraping from a website using python to store the html data and make a JSON file that uses a specific format. I already have the JSON file templet and have been able to get the HTML data as a .text file using BeautifulSoup. However I can not figure out how I can select specific parts of the data without directly changing the code. Is there something I can do or would it be necessary to plug in all of that data myself?
Thanks and below is the code I am using.

import requests
from bs4 import BeautifulSoup
# need to automate page swaping but for now test
# need to inciment over tr class-2 ->class 895 page = requests.get('https://www.finalfantasyd20.com/bestiary)
page = requests.get('https://www.finalfantasyd20.com/bestiary/undead/abadon/') 
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id= 'main')
Name = soup.find(id='abadon')
# print(Name.text)
# Type = soup.find() not gonna work with how this is caus no header
Stats = results.find_all('p') 
for stat in Stats:
    print(stat.text)
    Str = stat.find(string='Str')
    print(Str)

I have tried a number of attempts to isolate the specific value without putting it in myself but have continued to fail.

Answers

- PZ
- March 31, 2023 at 5:47 am
- 0 votes
0
As I try, the print(Str) output nothing. Maybe you need this:
```
str_list =[]

for stat in Stats:
    print(stat.text)
    Str = stat.find(string='Str')
    str_list.append(stat.text)
    #print(Str)
```
Login or Signup to reply.

- Ktifler
- March 31, 2023 at 6:53 am
- 0 votes
0
As I undertand you want to scrappe stats below STATISTICS header (h5).
As you see that below STATISTICS there a paragraph

and its childrens are you target:
```
Str 26, Dex 18...... 
```
we could see this as a tree where p is the parent node and
```
 Str 
 ' 26, '
 Dex
 ' 18, '
 .
 .
 .
```
are children nodes

one solution is to :

1/ find childrens with strong tag and ‘stat’ string inside it where stat could be Str or DEX … [ in your case stat.find("strong",string=’Str’)]

2/ navigate to its next sibling to extract the correspending value [ Str.next_sibling]

check out BeatifulSoup official documentation to find more https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=next_sibling#next-sibling-and-previous-sibling

here is a patched version of your code
```
import requests
from bs4 import BeautifulSoup
import re
# need to automate page swaping but for now test
# need to inciment over tr class-2 ->class 895 page = requests.get('https://www.finalfantasyd20.com/bestiary)
page = requests.get('https://www.finalfantasyd20.com/bestiary/undead/abadon/')

soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id= 'main')
Name = soup.find(id='abadon')
# print(Name.text)
# Type = soup.find() not gonna work with how this is caus no header
stats = results.find_all('p')
for stat in stats:
 # print(stat.text)
 # print(stat)
 Str = stat.find("strong",string='Str')
 if Str is not None:
 Str_text = Str.text
 # here is the value of Str
 value = Str.next_sibling
 print(value)
```
you can do the same for other stats.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Isolating Data while Web Scraping using python – Html

Answers