skip to Main Content

I have whatever HTML file, this html file has many tags that have children, and those children have more children and the deep is unknown.

An example file is:

<!DOCTYPE html>
<html lang="en">
  <style>

    h1 {
      font-size: 26px;
      text-transform: uppercase;
      font-style: italic;
    }


  </style>
  <head>
    <meta charset="UTF-8" />
    <title>The Basic Language of the Web: HTML</title>
  </head>
  <style>
  </style>
  <body>
    <!--
    <h1>The Basic Language of the Web: HTML</h1>
    <h2>The Basic Language of the Web: HTML</h2>
    <h3>The Basic Language of the Web: HTML</h3>
    <h4>The Basic Language of the Web: HTML</h4>
    <h5>The Basic Language of the Web: HTML</h5>
    <h6>The Basic Language of the Web: HTML</h6>
    -->

    <header>
      <h1>Code Magazine</h1>

      <nav>
        <a href="blog.html">Blog</a>
        <a href="blog.html">Blog</a>
        <a href="#">Challenges</a>
        <a href="#">Flexbox</a>
        <a href="#">CSS Grid</a>
      </nav>
    </header>

    <article>
      <header>
        <title>other title</title>
        <h2 style="text-transform: uppercase;
      font-style: italic;">The Basic Language of the Web: HTML</h2>

        <img
          src="img/laura-jones.jpg"
          alt="Headshot of Laura Jones"
          height="50"
          width="50"
        />

        <p>Posted by <strong>Laura Jones</strong> on Monday, June 21st 2027</p>

        <img
          src="img/post-img.jpg"
          alt="HTML code on a screen"
          width="500"
          height="200"
        />
      </header>

      <p>
        All modern websites and web applications are built using three
        <em>fundamental</em>
        technologies: HTML, CSS and JavaScript. These are the languages of the
        web.
      </p>

      <p>
        In this post, let's focus on HTML. We will learn what HTML is all about,
        and why you too should learn it.
      </p>

      <h3>What is HTML?</h3>
      <title> HTML</title>
      <p>
        HTML stands for <strong>H</strong>yper<strong>T</strong>ext
        <strong>M</strong>arkup <strong>L</strong>anguage. It's a markup
        language that web developers use to structure and describe the content
        of a webpage (not a programming language).
      </p>
      <p>
        HTML consists of elements that describe different types of content:
        paragraphs, links, headings, images, video, etc. Web browsers understand
        HTML and render HTML code as websites.
      </p>
      <p>In HTML, each element is made up of 3 parts:</p>

      <ol>
        <li>The opening tag</li>
        <li>The closing tag</li>
        <li>The actual element</li>
      </ol>
      <a href="#">Flexbox</a>
      <a href="#">CSS Grid</a>
      <p>
        You can learn more at
        <a
          href="https://developer.mozilla.org/en-US/docs/Web/HTML"
          target="_blank"
          >MDN Web Docs</a
        >.
      </p>

      <h3>Why should you learn HTML?</h3>

      <p>
        There are countless reasons for learning the fundamental language of the
        web. Here are 5 of them:
      </p>

      <ol>
        <li>To be able to use the fundamental web dev language</li>
        <li>
          To hand-craft beautiful websites instead of relying on tools like
          Worpress or Wix
        </li>
        <li>To build web applications</li>
        <li>To impress friends</li>
        <li>To have fun 😃</li>
      </ol>

      <p>Hopefully you learned something new here. See you next time!</p>
    </article>

    
  </body>
</html>

from that file, I want to obtain how many times is repeated a tag (also I need to know in which lines each tag appeared but I don’t know how to get it, that why it doesn’t appear in my code)

#expected output

number_of_appearances = { 
 'style':2,
 'head': 1,
 'meta':1,
 'title':3,
 'body':1,
 'header':2,
 'h1':1, 
 'nav':1,
 'a':7,
 'article': 1,
 'h2': 1,
 'img': 2,
 'p': 9,
  'ol': 2,
  'li': 8,
  'h3':2
  
  }

line_in_which_appeared={
 'style':[3,17],
 'body': [19],
 'meta':[14],
 'head':[13]

 #.... and so on
}

my code is the following:

import os
import xml.etree.ElementTree as et

def function_parser():

   #os.chdir("/Users/user/pythonProject/")

   #with open("index.html", "r") as fil:
   #   content = fil.read()  


   tree = et.parse('index.html')
   root = tree.getroot()
   l_elements={}
   for child in root:
      if child.tag not in l_elements:
         l_elements[child.tag] = 1
      else:
         l_elements[child.tag] += 1
      for ch in child:
         if ch.tag not in l_elements:
            l_elements[ch.tag] = 1
         else:
            l_elements[ch.tag] += 1
      for ch2 in child:
         if ch2.tag not in l_elements:
            l_elements[ch2.tag] = 1
         else:
            l_elements[ch2.tag] += 1

       # an so on..... I don't know how many nested loops I need because the deep of 
       # the tags can change 

   print('number_of_appearances= ', l_elements)

But I have some problems with that solution:

  1. I want to select the path where is the HTML file, something like: os.chdir("/Users/user/pythonProject/") with open("index.html", "r") as fil: content = fil.read()

but if I use this, I don’t know how to obtain the tags easily

  1. As I don’t know how many nested tags would be (one tag as head could have deep 1 but another like header can have a deep 3 or in other files the deep can be more), I don’t know how many nested for loops I need. There is a way to do it recursive?

  2. with this solution I cannot have the number of line in which I found each instance of a tag (I need to say that a tag appears in page x, y and w. For example, in this file, the tag style appeared in line 3 and 17

Note: It is preferred solution with standard python libraries (not the ones that you have to install)

2

Answers


  1. What you a trying to do is an excellent opportunity to use a simple recursive function, we can just make the function keep calling itself repeatedly with values obtained by the function’s processing, until a condition doesn’t apply.

    In this case, you are trying to get all HTML tags of an ElementTree, you already know how to get the tag of an element and how to get the children of the element. So how do we do what you intended using a recursive function?

    Simple, we can just get all children of the current node, and use the children as the argument of the next level function call, and call the function itself with the children. If there is no children, the recursive calls won’t happen, and the function has reached the end of a call chain, and returns the calculated result, the result propagates backwards on the call chain.

    We can just use e.iter() to get all children of element e, one important note is that all elements have itself as the first child, so we need to use next to skip it first.

    Code:

    import json
    from xml.etree import ElementTree
    
    def tree_html(file):
        tags = {}
        def walk_html(node):
            result = {}
            it = node.iter()
            next(it)
            for e in it:
                result |= walk_html(e)
            tags[node.tag] = tags.setdefault(node.tag, 0) + 1
            return {node.tag: result}
        html = ElementTree.parse(file)
        tree = walk_html(html.getroot())
        return {'tags_count': tags, 'tags_tree': tree}
    
    print(json.dumps(tree_html('D:/test1.html'), indent=4, ensure_ascii=False))
    

    Output:

    {
        "tags_count": {
            "style": 2,
            "meta": 2,
            "title": 14,
            "head": 1,
            "h1": 4,
            "a": 56,
            "nav": 4,
            "header": 6,
            "h2": 8,
            "img": 16,
            "strong": 48,
            "p": 40,
            "em": 8,
            "h3": 8,
            "li": 64,
            "ol": 8,
            "article": 2,
            "body": 1,
            "html": 1
        },
        "tags_tree": {
            "html": {
                "style": {},
                "head": {
                    "meta": {},
                    "title": {}
                },
                "meta": {},
                "title": {},
                "body": {
                    "header": {
                        "title": {},
                        "h2": {},
                        "img": {},
                        "p": {
                            "strong": {}
                        },
                        "strong": {}
                    },
                    "h1": {},
                    "nav": {
                        "a": {}
                    },
                    "a": {},
                    "article": {
                        "header": {
                            "title": {},
                            "h2": {},
                            "img": {},
                            "p": {
                                "strong": {}
                            },
                            "strong": {}
                        },
                        "title": {},
                        "h2": {},
                        "img": {},
                        "p": {},
                        "strong": {},
                        "em": {},
                        "h3": {},
                        "ol": {
                            "li": {}
                        },
                        "li": {},
                        "a": {}
                    },
                    "title": {},
                    "h2": {},
                    "img": {},
                    "p": {},
                    "strong": {},
                    "em": {},
                    "h3": {},
                    "ol": {
                        "li": {}
                    },
                    "li": {}
                },
                "header": {
                    "title": {},
                    "h2": {},
                    "img": {},
                    "p": {
                        "strong": {}
                    },
                    "strong": {}
                },
                "h1": {},
                "nav": {
                    "a": {}
                },
                "a": {},
                "article": {
                    "header": {
                        "title": {},
                        "h2": {},
                        "img": {},
                        "p": {
                            "strong": {}
                        },
                        "strong": {}
                    },
                    "title": {},
                    "h2": {},
                    "img": {},
                    "p": {},
                    "strong": {},
                    "em": {},
                    "h3": {},
                    "ol": {
                        "li": {}
                    },
                    "li": {},
                    "a": {}
                },
                "h2": {},
                "img": {},
                "p": {},
                "strong": {},
                "em": {},
                "h3": {},
                "ol": {
                    "li": {}
                },
                "li": {}
            }
        }
    }
    

    You can use dict.setdefault to set a default value, it returns the default value if the key isn’t present, and returns the value if key is found.

    d[k] = d.setdefault(k, 0) + 1
    

    Is equivalent to:

    if k in d:
        d[k] += 1
    else:
        d[k] = 1
    

    Using dict.setdefault is more concise.

    But in actual code, use collections.Counter

    Usage example:

    from collections import Counter
    
    c = Counter()
    
    for _ in range(3):
        c['one'] += 1
    

    In a Counter, if a key is not found it is automatically set to 0, so you don’t need if checks.


    And for your second question, no, I don’t think it is possible or practical to get the line number where a tag appeared. HTML is not a regular language, and it can be compressed so that everything is in one line and it still works. It isn’t always as human readable as your example, in practice most are crumpled together. The line number of tags has no specific meaning. There is no rule that tag x must be on line y.

    And if you really want, you can try using regexes to process the raw HTML text and get the line number of tags.

    For the tags in your example file, they start with less than sign, which is followed by arbitrary alphanumeric string, and end with greater than sign. The following regex will match them all: '^<w+>$'.

    To get the tags, use re.search('<w+>', some_text).

    But not all tags are this nice.

    I have found all the following tags on this very page (press F12)

    <!DOCTYPE html>
    <html class="html__responsive " lang="en">
    <head>
    <body class="ask-page unified-theme theme-dark">
    <div id="notify-container"></div>
    <div id="custom-header"></div>
    <header class="s-topbar ps-fixed t0 l0 js-top-bar">
    <div class="s-topbar--container">
    <title>Edit - Stack Overflow</title>
    <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196">
    <link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a">
    <link rel="image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a"> 
    <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
    <meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">
    <meta property="og:type" content= "website" />
    <meta property="og:url" content="https://stackoverflow.com/posts/76744699/edit"/>
    <meta property="og:site_name" content="Stack Overflow" />
    <meta property="og:image" itemprop="image primaryImageOfPage" content="https://cdn.sstatic.net/Sites/stackoverflow/Img/[email protected]?v=73d79a89bded" />
    <meta name="twitter:card" content="summary"/>
    <meta name="twitter:domain" content="stackoverflow.com"/>
    <meta name="twitter:title" property="og:title" itemprop="name" content="Edit" />
    <meta name="twitter:description" property="og:description" itemprop="description" content="Stack Overflow | The World&#x2019;s Largest Online Community for Developers" />
    <script id="webpack-public-path" type="text/uri-list">https://cdn.sstatic.net/</script>
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
    <script defer src="https://cdn.sstatic.net/Js/third-party/npm/@stackoverflow/stacks/dist/js/stacks.min.js?v=ad920dba3340"></script>
    <script src="https://cdn.sstatic.net/Js/stub.en.js?v=34b15b21ff80"></script>
    <link rel="stylesheet" type="text/css" href="https://cdn.sstatic.net/Shared/stacks.css?v=73d389de9f03">
    <link rel="stylesheet" type="text/css" href="https://cdn.sstatic.net/Sites/stackoverflow/primary.css?v=7e40c664dcde">
    <link rel="stylesheet" type="text/css" href="https://cdn.sstatic.net/Sites/stackoverflow/secondary.css?v=6793be5668e3">
    

    How can you write a regex that matches all these tags? I can’t.

    You shouldn’t parse HTML with regex.


    If you want to preserve the number of tags as is, that is, in your tree structure the tags can be duplicates, you can’t use dicts, because dictionary keys have to be unique, since dictionary is implemented using hashtable.

    I have reworked my function and used nested list structure to accurately represent the tree, but the result is ugly and hard to decipher:

    from xml.etree import ElementTree
    from collections import Counter
    
    def tree_html(file):
        tags = Counter()
        html = ElementTree.parse(file)
        tree = walk_html(html.getroot(), tags)
        return {'tags_count': tags, 'tags_tree': tree}
    
    def walk_html(node, tags):
        it = node.iter()
        next(it)
        result = [walk_html(e, tags) for e in it]
        tags[node.tag] += 1
        return [node.tag] + [result] * bool(result)
    

    Output from your example:

    {'tags_count': Counter({'li': 64,
              'a': 56,
              'strong': 48,
              'p': 40,
              'img': 16,
              'title': 14,
              'h2': 8,
              'em': 8,
              'h3': 8,
              'ol': 8,
              'header': 6,
              'h1': 4,
              'nav': 4,
              'style': 2,
              'meta': 2,
              'article': 2,
              'head': 1,
              'body': 1,
              'html': 1}),
     'tags_tree': ['html',
      [['style'],
       ['head', [['meta'], ['title']]],
       ['meta'],
       ['title'],
       ['style'],
       ['body',
        [['header',
          [['h1'],
           ['nav', [['a'], ['a'], ['a'], ['a'], ['a']]],
           ['a'],
           ['a'],
           ['a'],
           ['a'],
           ['a']]],
         ['h1'],
         ['nav', [['a'], ['a'], ['a'], ['a'], ['a']]],
         ['a'],
         ['a'],
         ['a'],
         ['a'],
         ['a'],
         ['article',
          [['header',
            [['title'],
             ['h2'],
             ['img'],
             ['p', [['strong']]],
             ['strong'],
             ['img']]],
           ['title'],
           ['h2'],
           ['img'],
           ['p', [['strong']]],
           ['strong'],
           ['img'],
           ['p', [['em']]],
           ['em'],
           ['p'],
           ['h3'],
           ['title'],
           ['p', [['strong'], ['strong'], ['strong'], ['strong']]],
           ['strong'],
           ['strong'],
           ['strong'],
           ['strong'],
           ['p'],
           ['p'],
           ['ol', [['li'], ['li'], ['li']]],
           ['li'],
           ['li'],
           ['li'],
           ['a'],
           ['a'],
           ['p', [['a']]],
           ['a'],
           ['h3'],
           ['p'],
           ['ol', [['li'], ['li'], ['li'], ['li'], ['li']]],
           ['li'],
           ['li'],
           ['li'],
           ['li'],
           ['li'],
           ['p']]],
         ['header',
          [['title'], ['h2'], ['img'], ['p', [['strong']]], ['strong'], ['img']]],
         ['title'],
         ['h2'],
         ['img'],
         ['p', [['strong']]],
         ['strong'],
         ['img'],
         ['p', [['em']]],
         ['em'],
         ['p'],
         ['h3'],
         ['title'],
         ['p', [['strong'], ['strong'], ['strong'], ['strong']]],
         ['strong'],
         ['strong'],
         ['strong'],
         ['strong'],
         ['p'],
         ['p'],
         ['ol', [['li'], ['li'], ['li']]],
         ['li'],
         ['li'],
         ['li'],
         ['a'],
         ['a'],
         ['p', [['a']]],
         ['a'],
         ['h3'],
         ['p'],
         ['ol', [['li'], ['li'], ['li'], ['li'], ['li']]],
         ['li'],
         ['li'],
         ['li'],
         ['li'],
         ['li'],
         ['p']]],
       ['header',
        [['h1'],
         ['nav', [['a'], ['a'], ['a'], ['a'], ['a']]],
         ['a'],
         ['a'],
         ['a'],
         ['a'],
         ['a']]],
       ['h1'],
       ['nav', [['a'], ['a'], ['a'], ['a'], ['a']]],
       ['a'],
       ['a'],
       ['a'],
       ['a'],
       ['a'],
       ['article',
        [['header',
          [['title'], ['h2'], ['img'], ['p', [['strong']]], ['strong'], ['img']]],
         ['title'],
         ['h2'],
         ['img'],
         ['p', [['strong']]],
         ['strong'],
         ['img'],
         ['p', [['em']]],
         ['em'],
         ['p'],
         ['h3'],
         ['title'],
         ['p', [['strong'], ['strong'], ['strong'], ['strong']]],
         ['strong'],
         ['strong'],
         ['strong'],
         ['strong'],
         ['p'],
         ['p'],
         ['ol', [['li'], ['li'], ['li']]],
         ['li'],
         ['li'],
         ['li'],
         ['a'],
         ['a'],
         ['p', [['a']]],
         ['a'],
         ['h3'],
         ['p'],
         ['ol', [['li'], ['li'], ['li'], ['li'], ['li']]],
         ['li'],
         ['li'],
         ['li'],
         ['li'],
         ['li'],
         ['p']]],
       ['header',
        [['title'], ['h2'], ['img'], ['p', [['strong']]], ['strong'], ['img']]],
       ['title'],
       ['h2'],
       ['img'],
       ['p', [['strong']]],
       ['strong'],
       ['img'],
       ['p', [['em']]],
       ['em'],
       ['p'],
       ['h3'],
       ['title'],
       ['p', [['strong'], ['strong'], ['strong'], ['strong']]],
       ['strong'],
       ['strong'],
       ['strong'],
       ['strong'],
       ['p'],
       ['p'],
       ['ol', [['li'], ['li'], ['li']]],
       ['li'],
       ['li'],
       ['li'],
       ['a'],
       ['a'],
       ['p', [['a']]],
       ['a'],
       ['h3'],
       ['p'],
       ['ol', [['li'], ['li'], ['li'], ['li'], ['li']]],
       ['li'],
       ['li'],
       ['li'],
       ['li'],
       ['li'],
       ['p']]]}
    
    Login or Signup to reply.
  2. You can use which has this feature (on which line the tag appeared, even the source position on that line):

    from bs4 import BeautifulSoup
    
    with open("your_file.html", "r") as f_in:
        soup = BeautifulSoup(f_in, "html.parser")
    
    out = {}
    for tag in soup.find_all():
        out.setdefault(tag.name, []).append(tag.sourceline)
    
    print(out)
    

    Prints:

    {
        "html": [2],
        "style": [3, 17],
        "head": [13],
        "meta": [14],
        "title": [15, 43, 77],
        "body": [19],
        "header": [29, 42],
        "h1": [30],
        "nav": [32],
        "a": [33, 34, 35, 36, 37, 96, 97, 100],
        "article": [41],
        "h2": [44],
        "img": [47, 56],
        "p": [54, 64, 71, 78, 84, 89, 98, 109, 125],
        "strong": [54, 79, 79, 80, 80],
        "em": [66],
        "h3": [76, 107],
        "ol": [91, 114],
        "li": [92, 93, 94, 115, 116, 120, 121, 122],
    }
    

    If you want counts:

    count = {k: len(v) for k, v in out.items()}
    print(count)
    

    Prints:

    {
        "html": 1,
        "style": 2,
        "head": 1,
        "meta": 1,
        "title": 3,
        "body": 1,
        "header": 2,
        "h1": 1,
        "nav": 1,
        "a": 8,
        "article": 1,
        "h2": 1,
        "img": 2,
        "p": 9,
        "strong": 5,
        "em": 1,
        "h3": 2,
        "ol": 2,
        "li": 8,
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search