how to get few characters after a string so that able to identify the string is in head tag or list item? - Artificial Intelligence

9113303
November 29, 2018
139 views
0 votes
2 Answers

I have collected all the head tags in the data given by
heads=str(soup.find_all(re.compile('^h[1-6]$'))). Then i am collecting data in between the head tags. A portion of source code is given.

import bs4
import re

data = '''
<html>
<body>
<div class="mob-icon"> <span></span></div>
<nav id="nav">
<ul class="" id="menu-home-welcome-banner">
<li class="menu-item menu-item-type-custom menu-item-object-custom current-menu-parent menu-item-has-children menu-item-1778" id="menu-item-1778"> <a class="submeny-top" href="http://www.uvionicstech.com" ontouchstart="">Home</a> </li>
<!--<li id="menu-item-1785" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1785"><a href="#about" class="scroll-to-link" ontouchstart="">About</a></li>-->
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1786" id="menu-item-1786"><a class="scroll-to-link" href="#data-analytics" ontouchstart="">PRODUCTS &amp; SOLUTIONS</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1787" id="menu-item-1787"><a class="scroll-to-link" href="#artificial-intelligence" ontouchstart="">Artificial Intelligence</a></li>
<!-- <li id="menu-item-1788" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1788"><a href="#iot" class="scroll-to-link" ontouchstart="">IOT</a></li> -->
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1788" id="menu-item-1788"><a class="scroll-to-link" href="#services" ontouchstart="">All in One Place</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1789" id="menu-item-1789"><a class="scroll-to-link" href="#eco-system" ontouchstart="">PARTNERS</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1791" id="menu-item-1791"><a class="scroll-to-link" href="#contact" ontouchstart="">Contact</a></li>
<h3 class="h3 text-center">PARTNERS</h3>
<h3 class="vc_custom_heading titel-left wow" data-wow-delay="0.3s">
<span class="titel-line"></span>Artificial Intelligence                                  </h3>

<h3 class="vc_custom_heading titel-left wow " data-wow-delay="0.3s"><span class="titel-line">
</span>Everything for your Business, <small>all in one place</small>
</h3>

</ul>
</nav>
</div>

</body>
</html>
'''

searched_word = 'Artificial Intelligence'
soup = bs4.BeautifulSoup(data, 'html.parser')
results = soup.body.find_all(string=re.compile('.*{0}.*'.format(searched_word)), recursive=True)

output:

 results
['Artificial Intelligence',
 'Artificial Intelligence                                  ']

Here the first Artificial Intelligence is of list item and second Artificial Intelligence is of head tag. I am trying to find out the word only with head tag. How to get the word only has head tag? Is there any way to find the next few characters followed by the word Artificial Intelligence. So that it will get Artificial Intelligence </h3>. Then it will not consider the list item.

Tags: beautifulsoup python-3.x

Answers

- chitown88
- November 29, 2018 at 1:13 pm
- 0 votes
0
since it’s only the head tag you want, could we just grab those, then search through those?
```
searched_word = 'Artificial Intelligence'

soup = bs4.BeautifulSoup(data, 'html.parser')
head_tags = soup.find_all('h3')


for ele in head_tags:
    if searched_word in ele.text:
        results = [ele.text.replace('n', '')]
if results:
    print(results)
else:
    print('No matches found')
```
gave output:
```
In [184]: results
Out[184]: ['Artificial Intelligence                                  ']
```
Login or Signup to reply.

if there are no child tag in the headings like

<h3 class="vc_custom_heading">Artificial Intelligence</h3>

you can combine your regex

results = soup.body.find_all(re.compile('^h[1-6]$'), 
                             string=re.compile(searched_word))

but your h3 contain child tag, I will create loop like first answer or create custom function to pass to find_all()

def head_contain_word(tag):
    return re.match(r'^h[1-6]$', tag.name) 
      and searched_word in tag.text

searched_word = 'Artificial Intelligence'
soup = bs4.BeautifulSoup(data, 'html.parser')
results = soup.body.find_all(head_contain_word)

results:

[<h3 class="vc_custom_heading titel-left wow" data-wow-delay="0.3s">
n<span class="titel-line"></span>Artificial Intelligence                                  </h3>]

Please signup or login to give your own answer.

Click here to cancel reply.

how to get few characters after a string so that able to identify the string is in head tag or list item? – Artificial Intelligence

Answers