skip to Main Content

I am trying to set up automatic Web Scraping from a website using python to store the html data and make a JSON file that uses a specific format. I already have the JSON file templet and have been able to get the HTML data as a .text file using BeautifulSoup. However I can not figure out how I can select specific parts of the data without directly changing the code. Is there something I can do or would it be necessary to plug in all of that data myself?
Thanks and below is the code I am using.

import requests
from bs4 import BeautifulSoup
# need to automate page swaping but for now test
# need to inciment over tr class-2 ->class 895 page = requests.get('https://www.finalfantasyd20.com/bestiary)
page = requests.get('https://www.finalfantasyd20.com/bestiary/undead/abadon/') 
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id= 'main')
Name = soup.find(id='abadon')
# print(Name.text)
# Type = soup.find() not gonna work with how this is caus no header
Stats = results.find_all('p') 
for stat in Stats:
    print(stat.text)
    Str = stat.find(string='Str')
    print(Str)

I have tried a number of attempts to isolate the specific value without putting it in myself but have continued to fail.

2

Answers


  1. As I try, the print(Str) output nothing. Maybe you need this:

    str_list =[]
    
    for stat in Stats:
        print(stat.text)
        Str = stat.find(string='Str')
        str_list.append(stat.text)
        #print(Str)
    
    Login or Signup to reply.
  2. As I undertand you want to scrappe stats below STATISTICS header (h5).
    As you see that below STATISTICS there a paragraph

    and its childrens are you target:

    <p><strong>Str</strong> 26, <strong>Dex</strong> 18......</p> 
    

    we could see this as a tree where p is the parent node and

    
           <strong>Str</strong> 
           ' 26, '
           <strong>Dex</strong>
           ' 18, '
            .
            .
            .
    
    

    are children nodes

    one solution is to :

    1/ find childrens with strong tag and ‘stat’ string inside it where stat could be Str or DEX … [ in your case stat.find("strong",string=’Str’)]

    2/ navigate to its next sibling to extract the correspending value [ Str.next_sibling]

    check out BeatifulSoup official documentation to find more https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=next_sibling#next-sibling-and-previous-sibling

    here is a patched version of your code

    import requests
    from bs4 import BeautifulSoup
    import re
    # need to automate page swaping but for now test
    # need to inciment over tr class-2 ->class 895 page = requests.get('https://www.finalfantasyd20.com/bestiary)
    page = requests.get('https://www.finalfantasyd20.com/bestiary/undead/abadon/')
    
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.find(id= 'main')
    Name = soup.find(id='abadon')
    # print(Name.text)
    # Type = soup.find() not gonna work with how this is caus no header
    stats = results.find_all('p')
    for stat in stats:
        # print(stat.text)
        # print(stat)
        Str = stat.find("strong",string='Str')
        if Str is not None:
            Str_text = Str.text
            # here is the value of Str
            value = Str.next_sibling
            print(value)
    

    you can do the same for other stats.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search