skip to Main Content

I’m trying to scrape specific details as a list from a page using BeautifulSoup in python.

<p class="collapse text in" id="list_2">
    <big>•</big>
    &nbsp;car
    <br>
    <big>•</big>
    &nbsp;bike&nbsp;
    <br> 
    <span id="list_hidden_2" class="inline_hidden collapse in" aria-expanded="true">
        <big>•</big>
        &nbsp;bus
        <br>
        <big>•</big>
        &nbsp;train
        <br><br> 
    </span>
    <span>...</span>
    <a data-id="list" href="#list_hidden_2" class="link_sm link_toggle" data-toggle="collapse"
        aria-expanded="true"></a>
</p>

I need a list with every text contained in the <p>
like this,

list = ['car', 'bike', 'bus', 'train']
from bs4 import BeautifulSoup
import requests

response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

p_tag = soup.find("p", {"id":"list_2"})
list = p_tag.text.strip()
print(list)

output:

• car• bike
• bus• train

How to convert this as a list like, list = ['car', 'bike', 'bus', 'train']

2

Answers


  1. Note: Avoid using python reserved terms (keywords), this could have unwanted effects on the results of your code.


    There are several ways to get your goal. I would recommend to work on your strategy selecting the elements. Select all <big> first and than pick its next_sibling:

    [e.next_sibling.get_text(strip=True) for e in soup.select('big')]
    
    Example
    from bs4 import BeautifulSoup
    html = '''
    <p class="collapse text in" id="list_2">
        <big>•</big>
        &nbsp;car
        <br>
        <big>•</big>
        &nbsp;bike&nbsp;
        <br> 
        <span id="list_hidden_2" class="inline_hidden collapse in" aria-expanded="true">
            <big>•</big>
            &nbsp;bus
            <br>
            <big>•</big>
            &nbsp;train
            <br><br> 
        </span>
        <span>...</span>
        <a data-id="list" href="#list_hidden_2" class="link_sm link_toggle" data-toggle="collapse"
            aria-expanded="true"></a>
    </p>
    '''
    soup = BeautifulSoup(html)
    
    item_list = [e.next_sibling.get_text(strip=True) for e in soup.select('big')]
    print(item_list)
    
    Output
    ['car', 'bike', 'bus', 'train']
    
    Login or Signup to reply.
  2. Even I thought the similar way as @HedgeHog did. The alternative way is also to select the <br> tag and get the text prior to it using previous_sibling.

    You can modify your scraping code as:

    from bs4 import BeautifulSoup
    import requests
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    p_tag = soup.find("p", {"id":"list_2"})
    output = [p.previous_sibling.text.strip() for p in p_tag.select('br')]
    print(*output)
    

    Output:

    car bike bus train 
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search