When trying to extract only the title tag info from html file the entire remaining html contents are printed out

Terry
January 7, 2024
112 views
1 vote
2 Answers

I am running vscode v1.85.1 within anaconda navigator 2.4.0 on a 64-bit windows machine running windows 10.
The python version in vscode is 3.9.12.
I am attempting a simple extraction of the contents of the title tag of an html file located in the same directory as my python script.

When I issue the print command for the element it shows as expected. However, when I try to extract just the contents of the title tag the script outputs the entire contents of the html file.

Here is the html file:

<!doctype html>
<html class="no-js" lang="">
    <head>
        <title>Test - A Sample Website</title>
        <meta charset="utf-8">
        <link rel="stylesheet" href="css/normalize.css">
        <link rel="stylesheet" href="css/main.css">
    </head>
    <body>
        <h1 id='site_title'>Test Website</h1>
        <hr></hr>
        <div class="article">
            <h2><a href="article_1.html">Article 1 Headline</a></h2>
            <p>This is a summary of article 1</p>
        </div>
        <hr></hr>
        <div class="article">
            <h2><a href="article_2.html">Article 2 Headline</a></h2>
            <p>This is a summary of article 2</p>
        </div>
        <hr></hr>
        <div id='footer'>
            <p>Footer Information</p>
        </div>
        <script>
        var para = document.createElement("p");
        var node = document.createTextNode("This is text generated by JavaScript.");
        para.appendChild(node);
        var element = document.getElementById("footer");
        element.appendChild(para);
        </script>
    </body>
</html>

and here is my python code:

from requests_html import HTML

with open('simple.html') as html_file:
    source = html_file.read()
    html = HTML(html=source)


match = html.find('title')
print(match[0])

when I run the above I get what I expect:
PS C:UsersTerryAnaconda Scripts> c:; cd ‘c:UsersTerryAnaconda Scripts’; & ‘C:UsersTerryanaconda3python.exe’ ‘c:UsersTerry.vscodeextensionsms-python.python-2023.22.1pythonFileslibpythondebugpyadapter/../..debugpylauncher’ ‘49312’ ‘–‘ ‘C:UsersTerryAnaconda ScriptsPythonscrapingrhtml-demo.py’

<Element ‘title’ >

PS C:UsersTerryAnaconda Scripts>

But when I try to extract the actual content of the title block using print(match[0].html) I expect to get output as follows:
Test – A Sample Website

…instead the output doesn’t stop at the closing tag, but prints the remaining contents of the file starting from the title tag:

<title>Test - A Sample Website</title>
<meta charset="utf-8"/>
<link rel="stylesheet" href="css/normalize.css"/>       
<link rel="stylesheet" href="css/main.css"/>
<body>

…etc etc etc… all the way to the end.

I got the code from a tutorial on youtube and the video comments don’t indicate that anyonne else has ever had the same issue, which is why I am asking here.

Any help would be appreciated.

Thanks

Answers

Chosen as BEST ANSWER
- Terry
- January 7, 2024 at 12:33 am
- 0 votes
0
It looks as though the problem is within my anaconda environment using vscode. The python script itself works fine in my Linux environment. So am closing this question now.

(Edit)

- str1ng
- January 6, 2024 at 9:10 pm
- 0 votes
0
The HTML representation of the element, including its children, is returned by the html attribute in requests_html. Because the title tag is part of the <head> element, calling match[0].html returns the complete contents of the element, not just the <title>. Instead of using .html, use the .text property to extract only the text content of the <title> tag. The text attribute returns the element’s text content, which is what you want.
```
from requests_html import HTML

with open('simple.html') as html_file:
    source = html_file.read()
    html = HTML(html=source)

match = html.find('title', first=True)  
if match:
    print(match.text)
else:
    print("Title tag not found.")
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.