I have whatever HTML file, this html file has many tags that have children, and those children have more children and the deep is unknown.
An example file is:
<!DOCTYPE html>
<html lang="en">
<style>
h1 {
font-size: 26px;
text-transform: uppercase;
font-style: italic;
}
</style>
<head>
<meta charset="UTF-8" />
<title>The Basic Language of the Web: HTML</title>
</head>
<style>
</style>
<body>
<!--
<h1>The Basic Language of the Web: HTML</h1>
<h2>The Basic Language of the Web: HTML</h2>
<h3>The Basic Language of the Web: HTML</h3>
<h4>The Basic Language of the Web: HTML</h4>
<h5>The Basic Language of the Web: HTML</h5>
<h6>The Basic Language of the Web: HTML</h6>
-->
<header>
<h1>Code Magazine</h1>
<nav>
<a href="blog.html">Blog</a>
<a href="blog.html">Blog</a>
<a href="#">Challenges</a>
<a href="#">Flexbox</a>
<a href="#">CSS Grid</a>
</nav>
</header>
<article>
<header>
<title>other title</title>
<h2 style="text-transform: uppercase;
font-style: italic;">The Basic Language of the Web: HTML</h2>
<img
src="img/laura-jones.jpg"
alt="Headshot of Laura Jones"
height="50"
width="50"
/>
<p>Posted by <strong>Laura Jones</strong> on Monday, June 21st 2027</p>
<img
src="img/post-img.jpg"
alt="HTML code on a screen"
width="500"
height="200"
/>
</header>
<p>
All modern websites and web applications are built using three
<em>fundamental</em>
technologies: HTML, CSS and JavaScript. These are the languages of the
web.
</p>
<p>
In this post, let's focus on HTML. We will learn what HTML is all about,
and why you too should learn it.
</p>
<h3>What is HTML?</h3>
<title> HTML</title>
<p>
HTML stands for <strong>H</strong>yper<strong>T</strong>ext
<strong>M</strong>arkup <strong>L</strong>anguage. It's a markup
language that web developers use to structure and describe the content
of a webpage (not a programming language).
</p>
<p>
HTML consists of elements that describe different types of content:
paragraphs, links, headings, images, video, etc. Web browsers understand
HTML and render HTML code as websites.
</p>
<p>In HTML, each element is made up of 3 parts:</p>
<ol>
<li>The opening tag</li>
<li>The closing tag</li>
<li>The actual element</li>
</ol>
<a href="#">Flexbox</a>
<a href="#">CSS Grid</a>
<p>
You can learn more at
<a
href="https://developer.mozilla.org/en-US/docs/Web/HTML"
target="_blank"
>MDN Web Docs</a
>.
</p>
<h3>Why should you learn HTML?</h3>
<p>
There are countless reasons for learning the fundamental language of the
web. Here are 5 of them:
</p>
<ol>
<li>To be able to use the fundamental web dev language</li>
<li>
To hand-craft beautiful websites instead of relying on tools like
Worpress or Wix
</li>
<li>To build web applications</li>
<li>To impress friends</li>
<li>To have fun 😃</li>
</ol>
<p>Hopefully you learned something new here. See you next time!</p>
</article>
</body>
</html>
from that file, I want to obtain how many times is repeated a tag (also I need to know in which lines each tag appeared but I don’t know how to get it, that why it doesn’t appear in my code)
#expected output
number_of_appearances = {
'style':2,
'head': 1,
'meta':1,
'title':3,
'body':1,
'header':2,
'h1':1,
'nav':1,
'a':7,
'article': 1,
'h2': 1,
'img': 2,
'p': 9,
'ol': 2,
'li': 8,
'h3':2
}
line_in_which_appeared={
'style':[3,17],
'body': [19],
'meta':[14],
'head':[13]
#.... and so on
}
my code is the following:
import os
import xml.etree.ElementTree as et
def function_parser():
#os.chdir("/Users/user/pythonProject/")
#with open("index.html", "r") as fil:
# content = fil.read()
tree = et.parse('index.html')
root = tree.getroot()
l_elements={}
for child in root:
if child.tag not in l_elements:
l_elements[child.tag] = 1
else:
l_elements[child.tag] += 1
for ch in child:
if ch.tag not in l_elements:
l_elements[ch.tag] = 1
else:
l_elements[ch.tag] += 1
for ch2 in child:
if ch2.tag not in l_elements:
l_elements[ch2.tag] = 1
else:
l_elements[ch2.tag] += 1
# an so on..... I don't know how many nested loops I need because the deep of
# the tags can change
print('number_of_appearances= ', l_elements)
But I have some problems with that solution:
- I want to select the path where is the HTML file, something like:
os.chdir("/Users/user/pythonProject/") with open("index.html", "r") as fil: content = fil.read()
but if I use this, I don’t know how to obtain the tags easily
-
As I don’t know how many nested tags would be (one tag as head could have deep 1 but another like header can have a deep 3 or in other files the deep can be more), I don’t know how many nested for loops I need. There is a way to do it recursive?
-
with this solution I cannot have the number of line in which I found each instance of a tag (I need to say that a tag appears in page x, y and w. For example, in this file, the tag style appeared in line 3 and 17
Note: It is preferred solution with standard python libraries (not the ones that you have to install)
2
Answers
What you a trying to do is an excellent opportunity to use a simple recursive function, we can just make the function keep calling itself repeatedly with values obtained by the function’s processing, until a condition doesn’t apply.
In this case, you are trying to get all HTML tags of an
ElementTree
, you already know how to get the tag of an element and how to get the children of the element. So how do we do what you intended using a recursive function?Simple, we can just get all children of the current node, and use the children as the argument of the next level function call, and call the function itself with the children. If there is no children, the recursive calls won’t happen, and the function has reached the end of a call chain, and returns the calculated result, the result propagates backwards on the call chain.
We can just use
e.iter()
to get all children of elemente
, one important note is that all elements have itself as the first child, so we need to usenext
to skip it first.Code:
Output:
You can use
dict.setdefault
to set a default value, it returns the default value if the key isn’t present, and returns the value if key is found.Is equivalent to:
Using
dict.setdefault
is more concise.But in actual code, use
collections.Counter
Usage example:
In a
Counter
, if a key is not found it is automatically set to 0, so you don’t need if checks.And for your second question, no, I don’t think it is possible or practical to get the line number where a tag appeared. HTML is not a regular language, and it can be compressed so that everything is in one line and it still works. It isn’t always as human readable as your example, in practice most are crumpled together. The line number of tags has no specific meaning. There is no rule that tag x must be on line y.
And if you really want, you can try using regexes to process the raw HTML text and get the line number of tags.
For the tags in your example file, they start with less than sign, which is followed by arbitrary alphanumeric string, and end with greater than sign. The following regex will match them all:
'^<w+>$'
.To get the tags, use
re.search('<w+>', some_text)
.But not all tags are this nice.
I have found all the following tags on this very page (press F12)
How can you write a regex that matches all these tags? I can’t.
You shouldn’t parse HTML with regex.
If you want to preserve the number of tags as is, that is, in your tree structure the tags can be duplicates, you can’t use
dict
s, because dictionary keys have to be unique, since dictionary is implemented using hashtable.I have reworked my function and used nested
list
structure to accurately represent the tree, but the result is ugly and hard to decipher:Output from your example:
You can use beautifulsoup which has this feature (on which line the tag appeared, even the source position on that line):
Prints:
If you want counts:
Prints: