skip to Main Content

So I’m trying to scrap some data from a website, and I want to extract the text inside the <li tag as shown below, the problem is they contain these ::markers that I understand are psudoelements, therefore they can’t be parsed using BeautifulSoup?

<ul>
    <li>
        ::marker
        (text)
    </li>
    <li>
        ::marker
        (text)
    </li>
</ul>

This is what I tried, but it didn’t returned other <li tags that don’t contain the ::marker

from bs4 import BeautifulSoup
import requests 


url = *the link of the website
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

reference = soup.find("li")
print(reference.text) 

#output is None

2

Answers


  1. As there are multiple items it is probably an idea to use find_all and then iterate through those entries calling get_text on each one; something like:

    list_items = soup.find_all("li")
    for element in list_items:
        print(element.get_text())
    

    You could add some extra code to check that find_all does actually return at least one element.

    Login or Signup to reply.
  2. You can use a CSS selector to extract the text content of the li elements, excluding the ::marker pseudo-elements.
    like this

    li_elements = soup.select('li')
    for li in li_elements:
        text = li.get_text(strip=True)
        print(text)
    

    Note that this will extract the text content of all li elements, including those without the ::marker pseudo-element. If you only want to extract the text content of li elements that don’t contain the ::marker pseudo-element, you can modify the CSS selector to exclude those elements:

    li_elements = soup.select('li:not(::marker)')
    

    This will select only the li elements that don’t contain the ::marker pseudo-element.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search