I’d like to pull specific links from a webpage using Python. In my example below I’m viewing a form 8-K from the SEC website with several links in it. A link for a press release but also a link to Table of Contents.
Here, I only want links that are considered Exhibits. All exhibits on any 8-K form should fall within the ‘ Item 9.01. Financial Statements and Exhibits’ section.
The code below will get all links on the 8-K but I only want the links within the Exhibit section.
import requests
from bs4 import BeautifulSoup
# Provide the URL and Headers
url = "https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx-20230123.htm"
headers = {"User-Agent":"INSERT YOUR USER AGENT INFO HERE"}
# Send a GET request to retrieve the HTML content
response = requests.get(url,headers=headers)
html_content = response.text
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
# Find all the links in the HTML
all_links = soup.find_all("a")
# Extract the URLs from the links and print them
for link in all_links:
url = link.get("href")
print(url)
2
Answers
I could not find any filter fields like
class
orid
so that I could filter the specific exhibita tags
.But, I noticed that the exhibit urls have the word "exhibit" on them, so the following code could find all those exhibit urls.
Looking at the page, you can search for all links with word
exibit
in theirhref=
:Prints: