i’m trying to write a code to get and clean the text from 100 websites per day. i came across an issue with one website that has More than one h1 tag and when you scroll to the next h1 tag the URL on the website changes for example this website.
what i have is basically this.
response=requests.get('https://economictimes.indiatimes.com/news/international/business/volkswagen-sets-5-7-revenue-growth-target-preaches-cost-discipline/articleshow/101168014.cms',headers={"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})
soup = BeautifulSoup(response.content, 'html.parser')
if len(soup.body.find_all('h1'))>2: #to check if there is more than one tag
if i.endswith(".cms"): #to check if the website has .cms ending (i have my doubts on this part)
for elem in soup.next_siblings:
if elem.name == 'h1':
GET THE TEXT SOME HOW
break
How can i get the text after first h1 tag? (please note that the text is in tag and not in
tag.
2
Answers
Maybe BeautifulSoup parser would be helpful here.
https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/
"soup.body.find_all(‘h1’)" – This will find all the "" elements within the ""
-we iterate through their next siblings until we find a "<p" tag (assuming that the p tag is the text).
get_text() – will grab the text under the p tag. ** strip=True** – removes any leading or whitespace.
Had a simillar issue once.
Hope that this helps
You had the right idea in trying to use
.next_siblings
, but you should keep in mind thatsoup.next_siblings
is unlikely to generate anything as the document itself is generally not expected to have any siblings.The following code finds the first header and then [if it doesn’t have any siblings], searches up the its parents to find the nearest one with siblings and then goes through the siblings but stops if another
h1
tag is reached.for the site in your example,
print(h1Sibs_text)
should printNote that you don’t have to use
'n---n'
to join the siblings’ text – you can use any string as separator.Btw, for that specific site’s articles, a much simpler way would be to target the header tag specifically by its class,
NOTE: using
select
with the*:has(>h1.artTitle)~*
selector is similar to usingsoup.find('h1',class_='artTitle').parent.next_siblings
, but is safer than chainingfind
,parent
,next_siblings
as it will simply return an empty list instead of raising any errors ifh1.artTitle
is not found.If you are scraping many different links, but you know the sites for most of them, you might want to break if up into
if...elif...
blocks for each site (or even groups of sites) and only use something generic like my first snippet for unlisted sites in theelse
block. You might even consider using something like this configurable parser with sets of selectors for each site.