was hoping someone could give me some suggestions.
So, I have a large HTML document. I need to extract data between 2 tags. It’s a dynamic document, so it can be different each time/ But there are a couple of constants. The starting point of extraction will be the section that starts with "Notes to Unaudited Condensed". You can see the section ID from the table of contents:
<a href="#a1NatureofOperations_790426"><span style="font-style:normal;font-weight:normal;">Notes to Unaudited Condensed Consolidated Financial Statements</span></a></p></td>
Basically I want to extract all content up until the next section ID, which always starts with "Item 2.":
<a href="#ITEM2MANAGEMENTSDISCUSSIONANDANALYSIS_77"><span style="font-style:normal;font-weight:normal;">Item 2.</span></a></p></td>
So, is there a way for me to get the tag ID from the anchor, and then I can search the document for that tag ID as the start / end of the parsing that is needed?
Or, perhaps there is some other Python HTML parser which can do much of the work for me?
Thanks!
2
Answers
I hope I’ve understood your question. It seems like you want to extract a content of an HTML file between two tags. I recommend you to use the BeautifulSoup library, which is a popular HTML parser, you can use this code as an example:
This script assumes that your content is within
<div>
tags with the specified IDs. Adjust the tags and attributes accordingly based on your HTML structure.Good luck! 😀
Here is one possible example how you can extract tags between two tags:
Prints: