How can I web scrape specific details from an HTML tag?

NikhilV
May 9, 2023
219 views
0 votes
2 Answers

I’m trying to scrape specific details as a list from a page using BeautifulSoup in python.

<p class="collapse text in" id="list_2">
    <big>•</big>
    &nbsp;car
    <br>
    <big>•</big>
    &nbsp;bike&nbsp;
    <br> 
    <span id="list_hidden_2" class="inline_hidden collapse in" aria-expanded="true">
        <big>•</big>
        &nbsp;bus
        <br>
        <big>•</big>
        &nbsp;train
        <br><br> 
    </span>
    <span>...</span>
    <a data-id="list" href="#list_hidden_2" class="link_sm link_toggle" data-toggle="collapse"
        aria-expanded="true"></a>
</p>

I need a list with every text contained in the <p>
like this,

list = ['car', 'bike', 'bus', 'train']

from bs4 import BeautifulSoup
import requests

response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

p_tag = soup.find("p", {"id":"list_2"})
list = p_tag.text.strip()
print(list)

output:

• car• bike
• bus• train

How to convert this as a list like, list = ['car', 'bike', 'bus', 'train']

Answers

Note: Avoid using python reserved terms (keywords), this could have unwanted effects on the results of your code.

There are several ways to get your goal. I would recommend to work on your strategy selecting the elements. Select all <big> first and than pick its next_sibling:

[e.next_sibling.get_text(strip=True) for e in soup.select('big')]

Example

from bs4 import BeautifulSoup
html = '''
<p class="collapse text in" id="list_2">
    <big>•</big>
    &nbsp;car
    <br>
    <big>•</big>
    &nbsp;bike&nbsp;
    <br> 
    <span id="list_hidden_2" class="inline_hidden collapse in" aria-expanded="true">
        <big>•</big>
        &nbsp;bus
        <br>
        <big>•</big>
        &nbsp;train
        <br><br> 
    </span>
    <span>...</span>
    <a data-id="list" href="#list_hidden_2" class="link_sm link_toggle" data-toggle="collapse"
        aria-expanded="true"></a>
</p>
'''
soup = BeautifulSoup(html)

item_list = [e.next_sibling.get_text(strip=True) for e in soup.select('big')]
print(item_list)

Output

['car', 'bike', 'bus', 'train']

- AmishaKirti
- May 9, 2023 at 10:22 am
- 0 votes
0
Even I thought the similar way as @HedgeHog did. The alternative way is also to select the <br> tag and get the text prior to it using previous_sibling.

You can modify your scraping code as:
```
from bs4 import BeautifulSoup
import requests

response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

p_tag = soup.find("p", {"id":"list_2"})
output = [p.previous_sibling.text.strip() for p in p_tag.select('br')]
print(*output)
```
Output:
```
car bike bus train 
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.