I am running vscode v1.85.1 within anaconda navigator 2.4.0 on a 64-bit windows machine running windows 10.
The python version in vscode is 3.9.12.
I am attempting a simple extraction of the contents of the title tag of an html file located in the same directory as my python script.
When I issue the print command for the element it shows as expected. However, when I try to extract just the contents of the title tag the script outputs the entire contents of the html file.
Here is the html file:
<!doctype html>
<html class="no-js" lang="">
<head>
<title>Test - A Sample Website</title>
<meta charset="utf-8">
<link rel="stylesheet" href="css/normalize.css">
<link rel="stylesheet" href="css/main.css">
</head>
<body>
<h1 id='site_title'>Test Website</h1>
<hr></hr>
<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>
<hr></hr>
<div class="article">
<h2><a href="article_2.html">Article 2 Headline</a></h2>
<p>This is a summary of article 2</p>
</div>
<hr></hr>
<div id='footer'>
<p>Footer Information</p>
</div>
<script>
var para = document.createElement("p");
var node = document.createTextNode("This is text generated by JavaScript.");
para.appendChild(node);
var element = document.getElementById("footer");
element.appendChild(para);
</script>
</body>
</html>
and here is my python code:
from requests_html import HTML
with open('simple.html') as html_file:
source = html_file.read()
html = HTML(html=source)
match = html.find('title')
print(match[0])
when I run the above I get what I expect:
PS C:UsersTerryAnaconda Scripts> c:; cd ‘c:UsersTerryAnaconda Scripts’; & ‘C:UsersTerryanaconda3python.exe’ ‘c:UsersTerry.vscodeextensionsms-python.python-2023.22.1pythonFileslibpythondebugpyadapter/../..debugpylauncher’ ‘49312’ ‘–‘ ‘C:UsersTerryAnaconda ScriptsPythonscrapingrhtml-demo.py’
<Element ‘title’ >
PS C:UsersTerryAnaconda Scripts>
But when I try to extract the actual content of the title block using print(match[0].html) I expect to get output as follows:
Test – A Sample Website
…instead the output doesn’t stop at the closing tag, but prints the remaining contents of the file starting from the title tag:
<title>Test - A Sample Website</title>
<meta charset="utf-8"/>
<link rel="stylesheet" href="css/normalize.css"/>
<link rel="stylesheet" href="css/main.css"/>
<body>
…etc etc etc… all the way to the end.
I got the code from a tutorial on youtube and the video comments don’t indicate that anyonne else has ever had the same issue, which is why I am asking here.
Any help would be appreciated.
Thanks
2
Answers
It looks as though the problem is within my anaconda environment using vscode. The python script itself works fine in my Linux environment. So am closing this question now.
The HTML representation of the element, including its children, is returned by the html attribute in requests_html. Because the title tag is part of the
<head>
element, callingmatch[0].html
returns the complete contents of the element, not just the<title>
. Instead of using.html
, use the.text
property to extract only the text content of the<title>
tag. The text attribute returns the element’s text content, which is what you want.