skip to Main Content

I am currently doing a project to improve my python knowledge. It’s an attempt to use beautifulsoup to find specific data being held in <p> with no class from the website text

I have pasted the modules I am using as well as the section of code that I’m having trouble fixing at the bottom.

snip of text I'm trying to parse
Any help is appreciated!

snip of html

import requests
from bs4 import BeautifulSoup
import csv
import re

mitigation = []

for id in id_list:
    page = requests.get(f"https://attack.mitre.org/techniques/{id2}")
    soup = BeautifulSoup(page.content, 'html.parser')
    paragraph = soup.find('p', class_ = '')
    status_code = page.status_code
    mitigation.append(paragraph)

I have attempted to use:

paragraph = soup.select('p')[4].text

instead of:

paragraph = soup.find('p', class_ = '')

in order to find the correct <p>

2

Answers


  1. for id in id_list:
        page = requests.get(f"https://attack.mitre.org/techniques/{id2}")
        soup = BeautifulSoup(page.content, 'html.parser')
        specific_header = soup.find('h2', text=re.compile('header To be searched'))
        if specific_header:
            paragraph = specific_header.find_next('p')
            mitigation.append(paragraph.text if paragraph else None)
        else:
            mitigation.append(None)
    

    Here the header to be searched can either be a regex or a text and this should work.

    Login or Signup to reply.
  2. The question could do with a little more clarity in order to provide a good and holistic approach to a solution, which is why it is only dealt with selectively here.

    In my view, @Barmar’s comments would have been correct approaches to solving the problem, given the focus of the question.

    However, in order to pick up on a specific content, let’s break away from the <p> without a class and look at the bigger picture. What other context can we use to localise this specific content? Adapt your selection so that you orientate yourself on the HTML structure, unique and less dynamic attributes.

    You are looking for mitigations, simply select this area using an id and proceed from there with the next steps – used css selectors here for chaining:

    soup.select_one('#mitigations + div table p')
    

    If you need it from multiple rows use select() over select_one() and iterate over its resultset.

    Example
    from bs4 import BeautifulSoup
    import requests
    
    soup = BeautifulSoup(requests.get('https://attack.mitre.org/techniques/').text)
    
    mitigation = []
    
    for id in ['T1588/001/','T1129/']:
        page = requests.get(f"https://attack.mitre.org/techniques/{id}")
        soup = BeautifulSoup(page.content, 'html.parser')
        paragraph = soup.select_one('#mitigations + div table p')
        mitigation.append(paragraph.text)
    
    mitigation
    

    ['This technique cannot be easily mitigated with preventive controls since it is based on behaviors performed outside of the scope of enterprise defenses and controls.','Identify and block potentially malicious software executed through this technique by using application control tools capable of preventing unknown modules from being loaded.']
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search