skip to Main Content

I am running vscode v1.85.1 within anaconda navigator 2.4.0 on a 64-bit windows machine running windows 10.
The python version in vscode is 3.9.12.
I am attempting a simple extraction of the contents of the title tag of an html file located in the same directory as my python script.

When I issue the print command for the element it shows as expected. However, when I try to extract just the contents of the title tag the script outputs the entire contents of the html file.

Here is the html file:

<!doctype html>
<html class="no-js" lang="">
    <head>
        <title>Test - A Sample Website</title>
        <meta charset="utf-8">
        <link rel="stylesheet" href="css/normalize.css">
        <link rel="stylesheet" href="css/main.css">
    </head>
    <body>
        <h1 id='site_title'>Test Website</h1>
        <hr></hr>
        <div class="article">
            <h2><a href="article_1.html">Article 1 Headline</a></h2>
            <p>This is a summary of article 1</p>
        </div>
        <hr></hr>
        <div class="article">
            <h2><a href="article_2.html">Article 2 Headline</a></h2>
            <p>This is a summary of article 2</p>
        </div>
        <hr></hr>
        <div id='footer'>
            <p>Footer Information</p>
        </div>
        <script>
        var para = document.createElement("p");
        var node = document.createTextNode("This is text generated by JavaScript.");
        para.appendChild(node);
        var element = document.getElementById("footer");
        element.appendChild(para);
        </script>
    </body>
</html>

and here is my python code:

from requests_html import HTML

with open('simple.html') as html_file:
    source = html_file.read()
    html = HTML(html=source)


match = html.find('title')
print(match[0])

when I run the above I get what I expect:
PS C:UsersTerryAnaconda Scripts> c:; cd ‘c:UsersTerryAnaconda Scripts’; & ‘C:UsersTerryanaconda3python.exe’ ‘c:UsersTerry.vscodeextensionsms-python.python-2023.22.1pythonFileslibpythondebugpyadapter/../..debugpylauncher’ ‘49312’ ‘–‘ ‘C:UsersTerryAnaconda ScriptsPythonscrapingrhtml-demo.py’

<Element ‘title’ >

PS C:UsersTerryAnaconda Scripts>

But when I try to extract the actual content of the title block using print(match[0].html) I expect to get output as follows:
Test – A Sample Website

…instead the output doesn’t stop at the closing tag, but prints the remaining contents of the file starting from the title tag:

<title>Test - A Sample Website</title>
<meta charset="utf-8"/>
<link rel="stylesheet" href="css/normalize.css"/>       
<link rel="stylesheet" href="css/main.css"/>
<body>

…etc etc etc… all the way to the end.

I got the code from a tutorial on youtube and the video comments don’t indicate that anyonne else has ever had the same issue, which is why I am asking here.

Any help would be appreciated.

Thanks

2

Answers


  1. Chosen as BEST ANSWER

    It looks as though the problem is within my anaconda environment using vscode. The python script itself works fine in my Linux environment. So am closing this question now.


  2. The HTML representation of the element, including its children, is returned by the html attribute in requests_html. Because the title tag is part of the <head> element, calling match[0].html returns the complete contents of the element, not just the <title>. Instead of using .html, use the .text property to extract only the text content of the <title> tag. The text attribute returns the element’s text content, which is what you want.

    from requests_html import HTML
    
    with open('simple.html') as html_file:
        source = html_file.read()
        html = HTML(html=source)
    
    match = html.find('title', first=True)  
    if match:
        print(match.text)
    else:
        print("Title tag not found.")
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search