skip to Main Content

was hoping someone could give me some suggestions.

So, I have a large HTML document. I need to extract data between 2 tags. It’s a dynamic document, so it can be different each time/ But there are a couple of constants. The starting point of extraction will be the section that starts with "Notes to Unaudited Condensed". You can see the section ID from the table of contents:

 <a href="#a1NatureofOperations_790426"><span style="font-style:normal;font-weight:normal;">Notes to Unaudited Condensed Consolidated Financial Statements</span></a></p></td>

Basically I want to extract all content up until the next section ID, which always starts with "Item 2.":

 <a href="#ITEM2MANAGEMENTSDISCUSSIONANDANALYSIS_77"><span style="font-style:normal;font-weight:normal;">Item 2.</span></a></p></td>

So, is there a way for me to get the tag ID from the anchor, and then I can search the document for that tag ID as the start / end of the parsing that is needed?

Or, perhaps there is some other Python HTML parser which can do much of the work for me?

Thanks!

2

Answers


  1. I hope I’ve understood your question. It seems like you want to extract a content of an HTML file between two tags. I recommend you to use the BeautifulSoup library, which is a popular HTML parser, you can use this code as an example:

    from bs4 import BeautifulSoup
    
    def extract_content(html):
        soup = BeautifulSoup(html, 'html.parser')
    
        # Find the starting point with "Notes to Unaudited Condensed"
        start_tag = soup.find('a', {'href': '#a1NatureofOperations_790426'})
    
        if start_tag:
            # Get the ID of the starting tag
            start_id = start_tag['href'][1:]
    
            # Find the ending point with "Item 2."
            end_tag = soup.find('a', {'href': '#ITEM2MANAGEMENTSDISCUSSIONANDANALYSIS_77'})
    
            if end_tag:
                # Get the ID of the ending tag
                end_id = end_tag['href'][1:]
    
                # Extract content between the two tags
                extracted_content = soup.find('div', {'id': start_id}).find_next('div', {'id': end_id})
    
                if extracted_content:
                    return extracted_content.get_text()
    
        # If the tags are not found, return None or handle accordingly
        return None
    
    # Example usage
    with open('your_html_file.html', 'r') as file:
        html_content = file.read()
    
    result = extract_content(html_content)
    
    if result:
        print(result)
    else:
        print("Content extraction failed.")
    
    

    This script assumes that your content is within <div> tags with the specified IDs. Adjust the tags and attributes accordingly based on your HTML structure.

    Good luck! 😀

    Login or Signup to reply.
  2. Here is one possible example how you can extract tags between two tags:

    from bs4 import BeautifulSoup
    
    html_text = """
    <div>
        Something other ...
    </div>
    <div>
        <a href="#"><span>Notes to Unaudited Condensed Consolidated Financial Statements</span></a>
    </div>
    <div>I want this...</div>
    <div>I want this too...</div>
    <div>
        <a href="#"><span>Item 2.</span></a>
    </div>
    <div>I DON'T want this...</div>"""
    
    soup = BeautifulSoup(html_text, "html.parser")
    
    tag_start = soup.find(
        lambda tag: "Notes to Unaudited Condensed Consolidated Financial Statements"
        in tag.text,
        recursive=False,
    )
    
    tag_end = soup.find(
        lambda tag: "Item 2." in tag.text,
        recursive=False,
    )
    
    tags_in_between, state = [], False
    for tag in soup.find_all(recursive=False):
        if tag is tag_start:
            state = True
        elif tag is tag_end:
            state = False
        elif state:
            tags_in_between.append(tag)
    
    print(tags_in_between)
    

    Prints:

    [<div>I want this...</div>, <div>I want this too...</div>]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search