skip to Main Content

I have the following line from a html file.

src="https://www.com/seek-images/seek-icons/horoskop-*pluto.gif*" alt="" />*Pluto*</div>n

Using Python , regex, How do I extract Pluto (or any planet name here)..from the above line.

I tried regex with [w]+ and

match = re.search(regexeg, line)
if match:
        print "begin"
        print match.groups()
        for group in match.groups():
                print ('{} '.format(group))

the group is empty. I appreciate your help.

2

Answers


  1. You can use beautifulsoup to parse the html.

    First install beautiful soup :

    pip install beautifulsoup4
    

    Then parse the entire html and extract the text from the div element, something like this:

    from bs4 import BeautifulSoup
    
    html_line = '<div> <img src="https://www.com/seek-images/seek-icons/horoskop-*pluto.gif*" alt="" />*Pluto*</div>n'
    
    # Parse the HTML line with BeautifulSoup
    soup = BeautifulSoup(html_line, 'html.parser')
    
    # Find the div tag containing the planet name
    div_tag = soup.find('div')
    
    if div_tag:
        planet_name = div_tag.get_text(strip=True)
        print("Planet Name:", planet_name)
    else:
        print("No div tag found.")
    
    Login or Signup to reply.
  2. As noted in comments, thou shalt not parse HTML with regexes.

    But if you really need to, here’s the simplest version for the string you’ve shown.

    >*([A-z]+)*<
    

    See it in action.

    It might very well be not specific enough for the whole HTML document, but for that you really need LXML or BeautifulSoup.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search