I am currently doing a project to improve my python knowledge. It’s an attempt to use beautifulsoup
to find specific data being held in <p>
with no class from the website text
I have pasted the modules I am using as well as the section of code that I’m having trouble fixing at the bottom.
Any help is appreciated!
import requests
from bs4 import BeautifulSoup
import csv
import re
mitigation = []
for id in id_list:
page = requests.get(f"https://attack.mitre.org/techniques/{id2}")
soup = BeautifulSoup(page.content, 'html.parser')
paragraph = soup.find('p', class_ = '')
status_code = page.status_code
mitigation.append(paragraph)
I have attempted to use:
paragraph = soup.select('p')[4].text
instead of:
paragraph = soup.find('p', class_ = '')
in order to find the correct <p>
2
Answers
Here the header to be searched can either be a regex or a text and this should work.
The question could do with a little more clarity in order to provide a good and holistic approach to a solution, which is why it is only dealt with selectively here.
In my view, @Barmar’s comments would have been correct approaches to solving the problem, given the focus of the question.
However, in order to pick up on a specific content, let’s break away from the
<p>
without a class and look at the bigger picture. What other context can we use to localise this specific content? Adapt your selection so that you orientate yourself on the HTML structure, unique and less dynamic attributes.You are looking for mitigations, simply select this area using an
id
and proceed from there with the next steps – usedcss selectors
here for chaining:If you need it from multiple rows use
select()
overselect_one()
and iterate over its resultset.Example