skip to Main Content

I am parsing some text in Python, using BeautifulSoup4.

The address block starts with a cell like this:

<td><strong>Address</strong></td>

I find the above cell using soup.find("td", "Address")

But, now some addresses have a highlight character too, like this:

<td><strong><span>*</span>Address</strong></td>

This has broken my matching. Is there still a way to find this TR?

2

Answers


  1. Chosen as BEST ANSWER

    I ended up with a solution like this:

        strong_blocks = soup.find_all("strong")
        def common_block(tag):
            return tag.find(string="Address", recursive=False)
        address_texts = list(filter(common_block, strong_blocks))
        if len(address_texts) == 1:
            address_text = address_texts[0]
            address_cell = address_text.parent
    
    

    The trick was that once I had a list of <strong> elements, I was able to use recursive=False to prevent the <span> being inspected.


  2. You can try using either CSS selector or re as follows:

    soup.select('td:has(strong:contains("Address"))')
    

    OR

    import re
    soup.find("td", text=re.compile("Address"))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search