skip to Main Content

I am trying to parse through and extract data from HTML using Python and Beautifulsoup.

The sample HTML looks like this (it has multiple such structures being repeated:

Sample Browser Render

Here is a sample code snippet that generated the above screenshot:

<div class="indented"><p id="1" class="">
    <strong><strong><strong><strong><strong><strong><strong><strong><strong><strong>Regional</strong></strong></strong></strong></strong></strong></strong></strong></strong></strong>
</p>
    <ul id="2" class="bulleted-list">
        <li style="list-style-type:disc">Lorem ipsum about region 1
        </li>
    </ul>
    <ul id="3" class="bulleted-list">
        <li style="list-style-type:disc">Lorem ipsum about region 2</li>
    </ul>
    <p id="4" class="">
        <strong><strong><strong><strong><strong><strong><strong><strong><strong>Country</strong></strong></strong></strong></strong></strong></strong></strong></strong>
    </p>
    <ul id="5" class="bulleted-list">
        <li style="list-style-type:disc">Lorem ipsum about country 1
            <ul id="ac" class="bulleted-list">
                <li style="list-style-type:circle">Sub lorem ipsum about country 1
                </li>
            </ul>
            <ul id="ad" class="bulleted-list">
                <li style="list-style-type:circle">Sub lorem ipsum about country 1-2</li>
            </ul>
            <ul id="ae" class="bulleted-list">
                <li style="list-style-type:circle">Sub lorem ipsum about country 1-3</li>
            </ul>
        </li>
    </ul>
    <ul id="6" class="bulleted-list">
        <li style="list-style-type:disc">Lorem ipsum about country 2
            <ul id="ab" class="bulleted-list">
                <li style="list-style-type:circle">Sub lorem ipsum about country 2
                </li>
            </ul>
        </li>
    </ul>
    <p id="7" class=""><strong><strong><strong><strong><strong>City</strong></strong></strong></strong></strong></p>
    <ul id="8" class="bulleted-list">
        <li style="list-style-type:disc">Lorem ipsum about city 1</li>
        <ul id="acc" class="bulleted-list">
            <li style="list-style-type:circle">Sub lorem ipsum about country 1
                <ul id="add" class="bulleted-list">
                    <li style="list-style-type:circle">Sub lorem ipsum about country 1-2</li>
                </ul>
            </li>
        </ul>
        <ul id="aee" class="bulleted-list">
            <li style="list-style-type:circle">Sub lorem ipsum about country 1-3</li>
        </ul>
    </ul>
    <ul id="9" class="bulleted-list">
        <li style="list-style-type:disc">Lorem ipsum about country 2
            <figure id="a" class="image"><a
                    href="a.png"><img
                    style="width:3040px"
                    src="a.png"></a>
            </figure>
        </li>
    </ul>
</div>

And here is a sample of how the above appears in Chrome’s console:
Sample HTML Structure

My objective is to get all the nested text under the City <p> tag. You can assume that the City <p> tag will always be the last <p> tag before the <div> it sits in is closed out. Under the City <p> tag, bullet points can have multiple layers of nesting and may have text in different formats (eg: bolded, italicized and so on). Any non-text data like images should be ignored.

Given the HTML structure seen in the Chrome console above, unfortunately the <ul> elements are not direct children of the <p> tag, which is making it difficult for me to accomplish this task.

Anyone have advice on how I can do this programmatically? Thanks

2

Answers


  1. IIUC, you can use tag.find_previous to check if the tag you’re in is under City or not. E.g.:

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html_text, "html.parser")  # html_text contains your snippet from the question
    
    text = []
    for ul in soup.select("ul:not(:has(ul))"):
        p = ul.find_previous("p")
        if p and p.text.strip() == "City":
            text.append(ul.get_text(strip=True, separator=" "))
    
    print("n".join(text))
    

    Prints:

    Lorem ipsum about city 1
    Lorem ipsum about country 2
    

    EDIT:

    from bs4 import BeautifulSoup
    
    html_text = """
    <div class="indented"><p id="1" class="">
        <strong><strong><strong><strong><strong><strong><strong><strong><strong><strong>Regional</strong></strong></strong></strong></strong></strong></strong></strong></strong></strong>
    </p>
        <ul id="2" class="bulleted-list">
            <li style="list-style-type:disc">Lorem ipsum about region 1
            </li>
        </ul>
        <ul id="3" class="bulleted-list">
            <li style="list-style-type:disc">Lorem ipsum about region 2</li>
        </ul>
        <p id="4" class="">
            <strong><strong><strong><strong><strong><strong><strong><strong><strong>Country</strong></strong></strong></strong></strong></strong></strong></strong></strong>
        </p>
        <ul id="5" class="bulleted-list">
            <li style="list-style-type:disc">Lorem ipsum about country 1
                <ul id="ac" class="bulleted-list">
                    <li style="list-style-type:circle">Sub lorem ipsum about country 1
                    </li>
                </ul>
                <ul id="ad" class="bulleted-list">
                    <li style="list-style-type:circle">Sub lorem ipsum about country 1-2</li>
                </ul>
                <ul id="ae" class="bulleted-list">
                    <li style="list-style-type:circle">Sub lorem ipsum about country 1-3</li>
                </ul>
            </li>
        </ul>
        <ul id="6" class="bulleted-list">
            <li style="list-style-type:disc">Lorem ipsum about country 2
                <ul id="ab" class="bulleted-list">
                    <li style="list-style-type:circle">Sub lorem ipsum about country 2
                    </li>
                </ul>
            </li>
        </ul>
        <p id="7" class=""><strong><strong><strong><strong><strong>City</strong></strong></strong></strong></strong></p>
        <ul id="8" class="bulleted-list">
            <li style="list-style-type:disc">Lorem ipsum about city 1</li>
            <ul id="acc" class="bulleted-list">
                <li style="list-style-type:circle">Sub lorem ipsum about country 1
                    <ul id="add" class="bulleted-list">
                        <li style="list-style-type:circle">Sub lorem ipsum about country 1-2</li>
                    </ul>
                </li>
            </ul>
            <ul id="aee" class="bulleted-list">
                <li style="list-style-type:circle">Sub lorem ipsum about country 1-3</li>
            </ul>
        </ul>
        <ul id="9" class="bulleted-list">
            <li style="list-style-type:disc">Lorem ipsum about country 2
                <figure id="a" class="image"><a
                        href="a.png"><img
                        style="width:3040px"
                        src="a.png"></a>
                </figure>
            </li>
        </ul>
    </div>"""
    
    soup = BeautifulSoup(html_text, "html.parser")
    
    
    def fn(ul, level=0):
        for tag in ul.find_all(["li", "ul"], recursive=False):
            if tag.name == "li":
                print(
                    "t" * level,
                    " ".join(c.strip() for c in tag.find_all(string=True, recursive=False)),
                    sep="",
                )
                for li_ul in tag.find_all("ul", recursive=False):
                    fn(li_ul, level + 1)
            else:
                fn(tag, level + 1)
    
    
    for ul in soup.select("div > ul"):
        fn(ul)
    

    Prints:

    Lorem ipsum about region 1
    Lorem ipsum about region 2
    Lorem ipsum about country 1   
            Sub lorem ipsum about country 1
            Sub lorem ipsum about country 1-2
            Sub lorem ipsum about country 1-3
    Lorem ipsum about country 2 
            Sub lorem ipsum about country 2
    Lorem ipsum about city 1
            Sub lorem ipsum about country 1 
                    Sub lorem ipsum about country 1-2
            Sub lorem ipsum about country 1-3
    Lorem ipsum about country 2 
    
    Login or Signup to reply.
  2. The HTML is invalid which makes working with standard tools more difficult than it needs to be. See the HTML validator.

    Your <p> elements should include the unordered lists and not end immediately after your opening <p>. Nested unordered lists must be done properly (inside a list item, not as a direct child of another unordered list.

    In terms of selecting the last <p> child of some <div> elements, or any elements whatsoever, this is what CSS selectors do. A web developer would simply use JavaScript’s .querySelector then get the .textContent.

    Any other tool for this job is the wrong tool.

    let divs = document.querySelectorAll('div');
    
    divs.forEach(function(div){
      console.log(div.querySelector('p:last-of-type').nextSibling.nextSibling.textContent)
    
    })
    <div class="indented"><p id="1" class="">
        <strong><strong><strong><strong><strong><strong><strong><strong><strong><strong>Regional</strong></strong></strong></strong></strong></strong></strong></strong></strong></strong>
    </p>
        <ul id="2" class="bulleted-list">
            <li style="list-style-type:disc">Lorem ipsum about region 1
            </li>
        </ul>
        <ul id="3" class="bulleted-list">
            <li style="list-style-type:disc">Lorem ipsum about region 2</li>
        </ul>
        <p id="4" class="">
            <strong><strong><strong><strong><strong><strong><strong><strong><strong>Country</strong></strong></strong></strong></strong></strong></strong></strong></strong>
        </p>
        <ul id="5" class="bulleted-list">
            <li style="list-style-type:disc">Lorem ipsum about country 1
                <ul id="ac" class="bulleted-list">
                    <li style="list-style-type:circle">Sub lorem ipsum about country 1
                    </li>
                </ul>
                <ul id="ad" class="bulleted-list">
                    <li style="list-style-type:circle">Sub lorem ipsum about country 1-2</li>
                </ul>
                <ul id="ae" class="bulleted-list">
                    <li style="list-style-type:circle">Sub lorem ipsum about country 1-3</li>
                </ul>
            </li>
        </ul>
        <ul id="6" class="bulleted-list">
            <li style="list-style-type:disc">Lorem ipsum about country 2
                <ul id="ab" class="bulleted-list">
                    <li style="list-style-type:circle">Sub lorem ipsum about country 2
                    </li>
                </ul>
            </li>
        </ul>
        <p id="7" class=""><strong><strong><strong><strong><strong>City</strong></strong></strong></strong></strong></p>
        <ul id="8" class="bulleted-list">
            <li style="list-style-type:disc">Lorem ipsum about city 1</li>
            <ul id="acc" class="bulleted-list">
                <li style="list-style-type:circle">Sub lorem ipsum about country 1
                    <ul id="add" class="bulleted-list">
                        <li style="list-style-type:circle">Sub lorem ipsum about country 1-2</li>
                    </ul>
                </li>
            </ul>
            <ul id="aee" class="bulleted-list">
                <li style="list-style-type:circle">Sub lorem ipsum about country 1-3</li>
            </ul>
        </ul>
        <ul id="9" class="bulleted-list">
            <li style="list-style-type:disc">Lorem ipsum about country 2
                <figure id="a" class="image"><a
                        href="a.png"><img
                        style="width:3040px"
                        src="a.png"></a>
                </figure>
            </li>
        </ul>
    </div>
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search