I’m trying to parse html code using Beautiful Soup. I make get-requests via requests module in Python and then convert html code to bs object. But I was faced with a problem. When I make BeautifulSoup object it changes the source code in a wrong way. I can make sure that it’s exactly fault of Beautiful Soup through comparing Soup object and response.text. For example, it changes the place of table tag.
Here the code:
page = requests.get(address)
soup = BeautifulSoup(page.text, 'html.parser')
print(page.text)
print(soup)
The output is:
<table>
<tr>
<td>
some tags
</td>
<tr>
<td>some tags
</td>
</tr>
</table>
<table>
<tr>
<td>
some tags
</td>
</table>
<tr>
<td>some tags
</td>
</tr>
Because of this changes I can’t parse the table correctly because table tag in bs object closes before the all tr tags close. What I should to do to resolve my problem?
Here is the full code of the page I’m trying to parse:
<!DOCTYPE html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width; initial-scale=1; maximum-scale=1; user-scalable=0;">
<title>
три версии нас</title>
<meta name="Description" content="Онлайн библиотека Траума ePub + FB2 + Mobi. Скачать книги для электронных читалок, iOS и Android устройств и Amazon Kindle">
<meta name="Keywords" content="Traum Library, traumlib, traum, Библиотека, Библиотека Траума, онлайн, Траум, Траума, скачать, бесплатно, электронные, книги, epub, FB2, Mobi, iBooks, Kindle, online, library, Traum, books, ebooks">
<link rel="icon" href="/favicon.ico" type="image/x-icon">
<link rel="apple-touch-icon" href="/apple-touch-icon.png">
<script src="/likely.js"></script>
<link rel="stylesheet" href="/likely.css">
<style type="text/css">
* {
margin: 0;
padding: 0;
}
html {
height: 100%;
font-size: 100%;
}
body {
font-size: 0.625em;
position: relative;
min-height: 100%;
margin: 0;
background-image: url(/back.png);
}
#book_header {
padding-top: 5px;
font-weight: bold;
margin-left: 10px;
margin-bottom: -5px;
color: #394263;
}
header {
border-bottom-style: solid;
border-bottom-width: 1px;
border-bottom-color: grey;
}
header {
overflow: hidden;
height: 68px;
width: 100%;
position: fixed;
left: 0px;
top: 0px;
background-color: rgb(248, 248, 248);
}
#content {
margin: 0 auto;
padding-top: 70px;
padding-bottom: 70px;
}
footer {
position: absolute;
bottom: 0px;
width: 100%;
}
.books p {
margin-top: 15px;
}
.books a {
position: absolute;
left: 70px;
}
p,
table,
td {
font: normal 1.2em sans-serif;
margin: 10px;
max-width: 768px;
}
h2 {
font: normal 1.4em sans-serif;
margin: 10px;
}
td,
tr {
padding: 4px;
}
a:link,
a:visited,
tr {
color: #394263;
text-decoration: none;
border-bottom: 1px solid;
border-color: #b2ccf0;
}
a:hover,
tr:hover {
color: #CC0000;
border-color: #f0b2b2
}
.pluso a {
border-bottom: none
}
;
tbody tr:hover {
background: RGB(235, 235, 235);
}
form {
display: inline-block;
margin-top: 10px;
}
td:first-child + td {
color: gray;
}
.speech {
border: 1px solid #DDD;
width: 300px;
padding: 0;
margin: 0
}
.speech input {
border: 0;
width: 240px;
display: inline-block;
height: 30px;
}
.speech img {
float: right;
width: 40px
}
</style>
<script>
if (document.cookie == '') {
var date = new Date(2030, 1, 1);
document.cookie = "book-format=mobi; path=/; expires=" + date.toUTCString();
}
</script>
</head>
<body OnLoad="document.search.find.focus();">
<header>
<h2><a href=/>Библиотека</a> Траума 2.33 Формат:
<select id='format-select' name='format-select'>
<option value='epub'>ePub</option><option value='fb2'>FB2</option><option selected value='mobi'>Mobi</option></select><br>
<form name=search action="/">
<input placeholder=" поиск книги / автора" type="search" results="10" name="find" name="q" id="transcript" style="width:223px;height:23px;" autofocus="autofocus" autocomplete="on" spellcheck="true" value="">
<input type="submit" value="Найти!" style="width:65px;height:23px">
</form></header><div id=content>
<section><p>Найдено в русской библиотеке Траума: 0</p>
<table></table></section><p>Найдено в английской библиотеке Траума: 0</p>
<table></table><p>Найдено в библиотеке lib.rus.ec (<b>только FB2</b>): 1<br><font size=1>Поиск идет только при отсутствии результатов в TraumLib.</font></p>
<div id='librusec' style='display:block;'><table><tr><td><div id=book_header><a href="/?find=Барнетт Лора">Барнетт Лора</a></td><td> </td><td> </td></div><tr><td><a href="/d.php?file=653860&name=Барнетт - Три версии нас.fb2"">Барнетт - Три версии нас<td>1.47MB</td></tr>
</div></table></div><footer><font size=2>
<div class="likely" data-url="http://lib.it.cx" data-title="Библиотека Траума 2.33 ePub/FB2/Mobi" style="margin:10px;">
<div class="twitter"></div>
<div class="facebook"></div>
<div class="gplus"></div>
<div class="vkontakte"></div>
<div class="telegram"></div>
<div class="odnoklassniki"></div>
<div class="linkedin"></div>
<div class="whatsapp"></div>
<div class="pinterest" data-media="i/pinnable.jpg"></div>
</div>
<pre style="margin:10px;">
Найдено в БД за 0.0159 с</div>
</footer>
</body>
<script>
var sel = document.getElementById("format-select");
sel.onchange = function() {
var date = new Date( 2030,1,1 );
var format = sel.value;
document.cookie="book-format=" + format + "; path=/; expires="+date.toUTCString();
location.reload();
};
</script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-49300233-1', 'auto');
ga('send', 'pageview');
</script>
</html>
2
Answers
Resolve problem with switching from 'html.parser' to 'lxml'. But if somebody know why html.parser didn't work correctly I would appreciate if you share.
page.text
is not properly formatted: the first tr element is never closed