I have the following line from a html file.
src="https://www.com/seek-images/seek-icons/horoskop-*pluto.gif*" alt="" />*Pluto*</div>n
Using Python , regex, How do I extract Pluto (or any planet name here)..from the above line.
I tried regex with [w]+ and
match = re.search(regexeg, line)
if match:
print "begin"
print match.groups()
for group in match.groups():
print ('{} '.format(group))
the group is empty. I appreciate your help.
2
Answers
You can use beautifulsoup to parse the html.
First install beautiful soup :
Then parse the entire html and extract the text from the div element, something like this:
As noted in comments, thou shalt not parse HTML with regexes.
But if you really need to, here’s the simplest version for the string you’ve shown.
See it in action.
It might very well be not specific enough for the whole HTML document, but for that you really need LXML or BeautifulSoup.