skip to Main Content

I’d like to pull specific links from a webpage using Python. In my example below I’m viewing a form 8-K from the SEC website with several links in it. A link for a press release but also a link to Table of Contents.

Here, I only want links that are considered Exhibits. All exhibits on any 8-K form should fall within the ‘ Item 9.01. Financial Statements and Exhibits’ section.

The code below will get all links on the 8-K but I only want the links within the Exhibit section.

import requests
from bs4 import BeautifulSoup

# Provide the URL and Headers
url = "https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx-20230123.htm"
headers = {"User-Agent":"INSERT YOUR USER AGENT INFO HERE"}


# Send a GET request to retrieve the HTML content
response = requests.get(url,headers=headers)
html_content = response.text

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

# Find all the links in the HTML
all_links = soup.find_all("a")

# Extract the URLs from the links and print them
for link in all_links:
    url = link.get("href")
    print(url)


2

Answers


  1. I could not find any filter fields like class or id so that I could filter the specific exhibit a tags.

    But, I noticed that the exhibit urls have the word "exhibit" on them, so the following code could find all those exhibit urls.

    # Extract the URLs from the links and print them
    base_endpoint = '/'.join(url.split('/')[:-1])
    for link in all_links:
        a_url = link.get("href")
        if 'exhibit' in url:
            print(f'{base_endpoint}/{a_url}')
    
    Login or Signup to reply.
  2. Looking at the page, you can search for all links with word exibit in their href=:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx-20230123.htm'
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0'}
    soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
    
    for link in soup.select('[href*="exhibit"]'):
        print(link.text)
        print(url.rsplit('/', maxsplit=1)[0] + '/' + link['href'])
        print()
    

    Prints:

    Press Release dated January 25, 2023 announcing financial results for the fiscal quarter ended December 25, 2022
    https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx_exhibitx991xq2x2023.htm
    
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search