skip to Main Content

I have collected all the head tags in the data given by
heads=str(soup.find_all(re.compile('^h[1-6]$'))). Then i am collecting data in between the head tags. A portion of source code is given.

import bs4
import re

data = '''
<html>
<body>
<div class="mob-icon"> <span></span></div>
<nav id="nav">
<ul class="" id="menu-home-welcome-banner">
<li class="menu-item menu-item-type-custom menu-item-object-custom current-menu-parent menu-item-has-children menu-item-1778" id="menu-item-1778"> <a class="submeny-top" href="http://www.uvionicstech.com" ontouchstart="">Home</a> </li>
<!--<li id="menu-item-1785" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1785"><a href="#about" class="scroll-to-link" ontouchstart="">About</a></li>-->
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1786" id="menu-item-1786"><a class="scroll-to-link" href="#data-analytics" ontouchstart="">PRODUCTS &amp; SOLUTIONS</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1787" id="menu-item-1787"><a class="scroll-to-link" href="#artificial-intelligence" ontouchstart="">Artificial Intelligence</a></li>
<!-- <li id="menu-item-1788" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1788"><a href="#iot" class="scroll-to-link" ontouchstart="">IOT</a></li> -->
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1788" id="menu-item-1788"><a class="scroll-to-link" href="#services" ontouchstart="">All in One Place</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1789" id="menu-item-1789"><a class="scroll-to-link" href="#eco-system" ontouchstart="">PARTNERS</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1791" id="menu-item-1791"><a class="scroll-to-link" href="#contact" ontouchstart="">Contact</a></li>
<h3 class="h3 text-center">PARTNERS</h3>
<h3 class="vc_custom_heading titel-left wow" data-wow-delay="0.3s">
<span class="titel-line"></span>Artificial Intelligence                                  </h3>

<h3 class="vc_custom_heading titel-left wow " data-wow-delay="0.3s"><span class="titel-line">
</span>Everything for your Business, <small>all in one place</small>
</h3>

</ul>
</nav>
</div>

</body>
</html>
'''

searched_word = 'Artificial Intelligence'
soup = bs4.BeautifulSoup(data, 'html.parser')
results = soup.body.find_all(string=re.compile('.*{0}.*'.format(searched_word)), recursive=True)

output:

 results
['Artificial Intelligence',
 'Artificial Intelligence                                  ']

Here the first Artificial Intelligence is of list item and second Artificial Intelligence is of head tag. I am trying to find out the word only with head tag. How to get the word only has head tag? Is there any way to find the next few characters followed by the word Artificial Intelligence. So that it will get Artificial Intelligence </h3>. Then it will not consider the list item.

2

Answers


  1. since it’s only the head tag you want, could we just grab those, then search through those?

    searched_word = 'Artificial Intelligence'
    
    soup = bs4.BeautifulSoup(data, 'html.parser')
    head_tags = soup.find_all('h3')
    
    
    for ele in head_tags:
        if searched_word in ele.text:
            results = [ele.text.replace('n', '')]
    if results:
        print(results)
    else:
        print('No matches found')
    

    gave output:

    In [184]: results
    Out[184]: ['Artificial Intelligence                                  ']
    
    Login or Signup to reply.
  2. if there are no child tag in the headings like

    <h3 class="vc_custom_heading">Artificial Intelligence</h3>
    

    you can combine your regex

    results = soup.body.find_all(re.compile('^h[1-6]$'), 
                                 string=re.compile(searched_word))
    

    but your h3 contain child tag, I will create loop like first answer or create custom function to pass to find_all()

    def head_contain_word(tag):
        return re.match(r'^h[1-6]$', tag.name) 
          and searched_word in tag.text
    
    searched_word = 'Artificial Intelligence'
    soup = bs4.BeautifulSoup(data, 'html.parser')
    results = soup.body.find_all(head_contain_word)
    

    results:

    [<h3 class="vc_custom_heading titel-left wow" data-wow-delay="0.3s">
    n<span class="titel-line"></span>Artificial Intelligence                                  </h3>]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search