Html - Python - Parsing / extracting sections using Python

LandonStatis
November 27, 2023
249 views
0 votes
2 Answers

was hoping someone could give me some suggestions.

So, I have a large HTML document. I need to extract data between 2 tags. It’s a dynamic document, so it can be different each time/ But there are a couple of constants. The starting point of extraction will be the section that starts with "Notes to Unaudited Condensed". You can see the section ID from the table of contents:

 <a href="#a1NatureofOperations_790426"><span style="font-style:normal;font-weight:normal;">Notes to Unaudited Condensed Consolidated Financial Statements</span></a></p></td>

Basically I want to extract all content up until the next section ID, which always starts with "Item 2.":

 <a href="#ITEM2MANAGEMENTSDISCUSSIONANDANALYSIS_77"><span style="font-style:normal;font-weight:normal;">Item 2.</span></a></p></td>

So, is there a way for me to get the tag ID from the anchor, and then I can search the document for that tag ID as the start / end of the parsing that is needed?

Or, perhaps there is some other Python HTML parser which can do much of the work for me?

Thanks!

Answers

I hope I’ve understood your question. It seems like you want to extract a content of an HTML file between two tags. I recommend you to use the BeautifulSoup library, which is a popular HTML parser, you can use this code as an example:

from bs4 import BeautifulSoup

def extract_content(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Find the starting point with "Notes to Unaudited Condensed"
    start_tag = soup.find('a', {'href': '#a1NatureofOperations_790426'})

    if start_tag:
        # Get the ID of the starting tag
        start_id = start_tag['href'][1:]

        # Find the ending point with "Item 2."
        end_tag = soup.find('a', {'href': '#ITEM2MANAGEMENTSDISCUSSIONANDANALYSIS_77'})

        if end_tag:
            # Get the ID of the ending tag
            end_id = end_tag['href'][1:]

            # Extract content between the two tags
            extracted_content = soup.find('div', {'id': start_id}).find_next('div', {'id': end_id})

            if extracted_content:
                return extracted_content.get_text()

    # If the tags are not found, return None or handle accordingly
    return None

# Example usage
with open('your_html_file.html', 'r') as file:
    html_content = file.read()

result = extract_content(html_content)

if result:
    print(result)
else:
    print("Content extraction failed.")

This script assumes that your content is within <div> tags with the specified IDs. Adjust the tags and attributes accordingly based on your HTML structure.

Good luck! 😀

Here is one possible example how you can extract tags between two tags:

from bs4 import BeautifulSoup

html_text = """
<div>
    Something other ...
</div>
<div>
    <a href="#"><span>Notes to Unaudited Condensed Consolidated Financial Statements</span></a>
</div>
<div>I want this...</div>
<div>I want this too...</div>
<div>
    <a href="#"><span>Item 2.</span></a>
</div>
<div>I DON'T want this...</div>"""

soup = BeautifulSoup(html_text, "html.parser")

tag_start = soup.find(
    lambda tag: "Notes to Unaudited Condensed Consolidated Financial Statements"
    in tag.text,
    recursive=False,
)

tag_end = soup.find(
    lambda tag: "Item 2." in tag.text,
    recursive=False,
)

tags_in_between, state = [], False
for tag in soup.find_all(recursive=False):
    if tag is tag_start:
        state = True
    elif tag is tag_end:
        state = False
    elif state:
        tags_in_between.append(tag)

print(tags_in_between)

Prints:

[<div>I want this...</div>, <div>I want this too...</div>]

Please signup or login to give your own answer.

Click here to cancel reply.

Html – Python – Parsing / extracting sections using Python

Answers