Html - How to use beautifulsoup4 to find <p> with no class?

TrashThrash
January 20, 2024
346 views
0 votes
2 Answers

I am currently doing a project to improve my python knowledge. It’s an attempt to use beautifulsoup to find specific data being held in <p> with no class from the website text

I have pasted the modules I am using as well as the section of code that I’m having trouble fixing at the bottom.

Any help is appreciated!

import requests
from bs4 import BeautifulSoup
import csv
import re

mitigation = []

for id in id_list:
    page = requests.get(f"https://attack.mitre.org/techniques/{id2}")
    soup = BeautifulSoup(page.content, 'html.parser')
    paragraph = soup.find('p', class_ = '')
    status_code = page.status_code
    mitigation.append(paragraph)

I have attempted to use:

paragraph = soup.select('p')[4].text

instead of:

paragraph = soup.find('p', class_ = '')

in order to find the correct <p>

Answers

- ArunbhYashaswi
- January 20, 2024 at 10:00 am
- 0 votes
0
```
for id in id_list:
    page = requests.get(f"https://attack.mitre.org/techniques/{id2}")
    soup = BeautifulSoup(page.content, 'html.parser')
    specific_header = soup.find('h2', text=re.compile('header To be searched'))
    if specific_header:
        paragraph = specific_header.find_next('p')
        mitigation.append(paragraph.text if paragraph else None)
    else:
        mitigation.append(None)
```
Here the header to be searched can either be a regex or a text and this should work.
Login or Signup to reply.

- HedgeHog
- January 20, 2024 at 10:41 am
- 0 votes
0
The question could do with a little more clarity in order to provide a good and holistic approach to a solution, which is why it is only dealt with selectively here.

In my view, @Barmar’s comments would have been correct approaches to solving the problem, given the focus of the question.

However, in order to pick up on a specific content, let’s break away from the <p> without a class and look at the bigger picture. What other context can we use to localise this specific content? Adapt your selection so that you orientate yourself on the HTML structure, unique and less dynamic attributes.

You are looking for mitigations, simply select this area using an id and proceed from there with the next steps – used css selectors here for chaining:
```
soup.select_one('#mitigations + div table p')
```
If you need it from multiple rows use select() over select_one() and iterate over its resultset.

Example
```
from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get('https://attack.mitre.org/techniques/').text)

mitigation = []

for id in ['T1588/001/','T1129/']:
    page = requests.get(f"https://attack.mitre.org/techniques/{id}")
    soup = BeautifulSoup(page.content, 'html.parser')
    paragraph = soup.select_one('#mitigations + div table p')
    mitigation.append(paragraph.text)

mitigation
```
```
['This technique cannot be easily mitigated with preventive controls since it is based on behaviors performed outside of the scope of enterprise defenses and controls.','Identify and block potentially malicious software executed through this technique by using application control tools capable of preventing unknown modules from being loaded.']
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – How to use beautifulsoup4 to find <p> with no class?

Answers

Example