skip to Main Content

I want to find all the li elements nested within <ol class="messageList" id="messageList">. I have tried the following solutions and they all return 0 messages:

messages = soup.find_all("ol")
messages = soup.find_all('div', class_='messageContent')
messages = soup.find_all("li")
messages = soup.select('ol > li')
messages = soup.select('.messageList > li')

The full html can be seen here in this gist.

  1. Just wondering what is the correct way of grabbing these list items.
  2. In beautiful soup do you have to know the nested path to get the element you are after. Or would doing something like soup.find_all("li") supposed to return all elements, whether it’s nested or not?

Happy for non-bs4 answers too.

Update

This is how I got the code.

from bs4 import BeautifulSoup

# Load the HTML content
with open('/tmp/property.html', 'r', encoding='utf-8') as file:
    html_content = file.read()

# Create a BeautifulSoup object and specify the parser
soup = BeautifulSoup(html_content, 'html.parser')

The file is in the gist link above.

Update 2

I got it working using requests library. Looks like manually downloading the file might have caused some of the html to break?

import requests
from bs4 import BeautifulSoup

url = "https://www.propertychat.com.au/community/threads/melbourne-property-market-2024.75213/"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")
messages = soup.select('.messageList > li')

2

Answers


  1. Thank you for offering example code + data.

    This will pick out the various list item elements you wanted:

    def parse(in_file: Path) -> None:
        hrule = "=" * 60
        soup = BeautifulSoup(in_file.read_text(), "html.parser")
    
        ol = soup.find("ol", {"class": "messageList", "id": "messageList"})
    
        for li in ol.find_all("li"):
    
            print(f"nn{hrule}nn{li}")
    

    You could certainly ask for soup.find_all("li").
    That would retrieve all list items in the document,
    even if they are under some other <ul> that you append to the document.

    I started out with looping over all the <ol>‘s,
    until I noticed the document only has one of them.

    Typically I will write nested loops corresponding
    to the nesting of the document’s elements,
    but you certainly don’t have to.
    It’s just easier to make sense of the results that way,
    since you have context about which container the element came from.

    Login or Signup to reply.
  2. Maybe this is what you’re looking for?

    import requests as r
    from bs4 import BeautifulSoup as bs
    
    URL = "https://www.propertychat.com.au/community/threads/melbourne-property-market-2024.75213/"
    page = r.get(URL)
    
    soup_obj = bs(page.content, "html.parser")
    
    results_object = soup_obj.find("ol")
    
    li_list = [results_object.find_all("li")]
    
    print(li_list)
    

    This code uses requests and bs4 to find the ol element that you mentioned and then a list of the li elements contained within the ol element is obtained and stored in the array object called li_list whose contents is then printed.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search