Extract a word from a line of HTML

Class1
November 23, 2023
165 views
0 votes
2 Answers

I have the following line from a html file.

src="https://www.com/seek-images/seek-icons/horoskop-*pluto.gif*" alt="" />*Pluto*</div>n

Using Python , regex, How do I extract Pluto (or any planet name here)..from the above line.

I tried regex with [w]+ and

match = re.search(regexeg, line)
if match:
        print "begin"
        print match.groups()
        for group in match.groups():
                print ('{} '.format(group))

the group is empty. I appreciate your help.

Answers

You can use beautifulsoup to parse the html.

First install beautiful soup :

pip install beautifulsoup4

Then parse the entire html and extract the text from the div element, something like this:

from bs4 import BeautifulSoup

html_line = '<div> <img src="https://www.com/seek-images/seek-icons/horoskop-*pluto.gif*" alt="" />*Pluto*</div>n'

# Parse the HTML line with BeautifulSoup
soup = BeautifulSoup(html_line, 'html.parser')

# Find the div tag containing the planet name
div_tag = soup.find('div')

if div_tag:
    planet_name = div_tag.get_text(strip=True)
    print("Planet Name:", planet_name)
else:
    print("No div tag found.")

- Nikolaj
- November 22, 2023 at 5:47 pm
- 0 votes
0
As noted in comments, thou shalt not parse HTML with regexes.

But if you really need to, here’s the simplest version for the string you’ve shown.
```
>*([A-z]+)*<
```
See it in action.

It might very well be not specific enough for the whole HTML document, but for that you really need LXML or BeautifulSoup.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.