Html - How Can I Pull Specific Links From A Webpage Using Python?

Ericander1
June 2, 2023
120 views
0 votes
2 Answers

I’d like to pull specific links from a webpage using Python. In my example below I’m viewing a form 8-K from the SEC website with several links in it. A link for a press release but also a link to Table of Contents.

Here, I only want links that are considered Exhibits. All exhibits on any 8-K form should fall within the ‘ Item 9.01. Financial Statements and Exhibits’ section.

The code below will get all links on the 8-K but I only want the links within the Exhibit section.

import requests
from bs4 import BeautifulSoup

# Provide the URL and Headers
url = "https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx-20230123.htm"
headers = {"User-Agent":"INSERT YOUR USER AGENT INFO HERE"}


# Send a GET request to retrieve the HTML content
response = requests.get(url,headers=headers)
html_content = response.text

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

# Find all the links in the HTML
all_links = soup.find_all("a")

# Extract the URLs from the links and print them
for link in all_links:
    url = link.get("href")
    print(url)

Answers

- DDt
- June 2, 2023 at 11:51 pm
- 0 votes
0
I could not find any filter fields like class or id so that I could filter the specific exhibit a tags.

But, I noticed that the exhibit urls have the word "exhibit" on them, so the following code could find all those exhibit urls.
```
# Extract the URLs from the links and print them
base_endpoint = '/'.join(url.split('/')[:-1])
for link in all_links:
    a_url = link.get("href")
    if 'exhibit' in url:
        print(f'{base_endpoint}/{a_url}')
```
Login or Signup to reply.

Looking at the page, you can search for all links with word exibit in their href=:

import requests
from bs4 import BeautifulSoup

url = 'https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx-20230123.htm'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for link in soup.select('[href*="exhibit"]'):
    print(link.text)
    print(url.rsplit('/', maxsplit=1)[0] + '/' + link['href'])
    print()

Prints:

Press Release dated January 25, 2023 announcing financial results for the fiscal quarter ended December 25, 2022
https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx_exhibitx991xq2x2023.htm

Please signup or login to give your own answer.

Click here to cancel reply.

Html – How Can I Pull Specific Links From A Webpage Using Python?

Answers