skip to Main Content

I’m trying to parse html code using Beautiful Soup. I make get-requests via requests module in Python and then convert html code to bs object. But I was faced with a problem. When I make BeautifulSoup object it changes the source code in a wrong way. I can make sure that it’s exactly fault of Beautiful Soup through comparing Soup object and response.text. For example, it changes the place of table tag.

Here the code:

page = requests.get(address)

soup = BeautifulSoup(page.text, 'html.parser')


print(page.text)
print(soup)

The output is:

<table>
  <tr>
    <td>
      some tags
    </td>
  <tr>
    <td>some tags
    </td>
  </tr>
</table>


<table>
      <tr>
        <td>
          some tags
        </td>
</table>
      <tr>
        <td>some tags
        </td>
      </tr>

Because of this changes I can’t parse the table correctly because table tag in bs object closes before the all tr tags close. What I should to do to resolve my problem?

Here is the full code of the page I’m trying to parse:

<!DOCTYPE html>

<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width; initial-scale=1; maximum-scale=1; user-scalable=0;">
    <title>
        три версии нас</title>
    <meta name="Description" content="Онлайн библиотека Траума ePub + FB2 + Mobi. Скачать книги для электронных читалок, iOS и Android устройств и Amazon Kindle">
    <meta name="Keywords" content="Traum Library, traumlib, traum, Библиотека, Библиотека Траума, онлайн, Траум, Траума, скачать, бесплатно, электронные, книги, epub, FB2, Mobi, iBooks, Kindle, online, library, Traum, books, ebooks">
    <link rel="icon" href="/favicon.ico" type="image/x-icon">
    <link rel="apple-touch-icon" href="/apple-touch-icon.png">
    <script src="/likely.js"></script>
    <link rel="stylesheet" href="/likely.css">
    <style type="text/css">
        * {
            margin: 0;
            padding: 0;
        }

        html {
            height: 100%;
            font-size: 100%;
        }

        body {
            font-size: 0.625em;
            position: relative;
            min-height: 100%;
            margin: 0;
            background-image: url(/back.png);
        }

        #book_header {
            padding-top: 5px;
            font-weight: bold;
            margin-left: 10px;
            margin-bottom: -5px;
            color: #394263;
        }

        header {
            border-bottom-style: solid;
            border-bottom-width: 1px;
            border-bottom-color: grey;
        }

        header {
            overflow: hidden;
            height: 68px;
            width: 100%;
            position: fixed;
            left: 0px;
            top: 0px;
            background-color: rgb(248, 248, 248);
        }

        #content {
            margin: 0 auto;
            padding-top: 70px;
            padding-bottom: 70px;
        }

        footer {
            position: absolute;
            bottom: 0px;
            width: 100%;
        }

        .books p {
            margin-top: 15px;
        }

        .books a {
            position: absolute;
            left: 70px;
        }

        p,
        table,
        td {
            font: normal 1.2em sans-serif;
            margin: 10px;
            max-width: 768px;
        }

        h2 {
            font: normal 1.4em sans-serif;
            margin: 10px;
        }

        td,
        tr {
            padding: 4px;
        }

        a:link,
        a:visited,
        tr {
            color: #394263;
            text-decoration: none;
            border-bottom: 1px solid;
            border-color: #b2ccf0;
        }

        a:hover,
        tr:hover {
            color: #CC0000;
            border-color: #f0b2b2
        }

        .pluso a {
            border-bottom: none
        }

        ;
        tbody tr:hover {
            background: RGB(235, 235, 235);
        }

        form {
            display: inline-block;
            margin-top: 10px;
        }

        td:first-child + td {
            color: gray;
        }

        .speech {
            border: 1px solid #DDD;
            width: 300px;
            padding: 0;
            margin: 0
        }

        .speech input {
            border: 0;
            width: 240px;
            display: inline-block;
            height: 30px;
        }

        .speech img {
            float: right;
            width: 40px
        }
    </style>
    <script>
        if (document.cookie == '') {
            var date = new Date(2030, 1, 1);
            document.cookie = "book-format=mobi; path=/; expires=" + date.toUTCString();
        }
    </script>
</head>

<body OnLoad="document.search.find.focus();">
    <header>
        <h2><a href=/>Библиотека</a> Траума 2.33 &nbsp;Формат:&nbsp;
<select id='format-select' name='format-select'> 
<option value='epub'>ePub</option><option value='fb2'>FB2</option><option selected value='mobi'>Mobi</option></select><br>
<form name=search action="/">
<input placeholder="&nbsp;поиск книги / автора" type="search" results="10" name="find" name="q" id="transcript" style="width:223px;height:23px;" autofocus="autofocus" autocomplete="on" spellcheck="true" value="">
<input type="submit" value="Найти!" style="width:65px;height:23px">
</form></header><div id=content>
<section><p>Найдено в русской библиотеке Траума: 0</p>
<table></table></section><p>Найдено в английской библиотеке Траума: 0</p>
<table></table><p>Найдено в библиотеке lib.rus.ec (<b>только FB2</b>): 1<br><font size=1>Поиск идет только при отсутствии результатов в TraumLib.</font></p>
<div id='librusec' style='display:block;'><table><tr><td><div id=book_header><a href="/?find=Барнетт Лора">Барнетт Лора</a></td><td>&nbsp;</td><td>&nbsp;</td></div><tr><td><a href="/d.php?file=653860&name=Барнетт - Три версии нас.fb2"">Барнетт - Три версии нас<td>1.47MB</td></tr>
</div></table></div><footer><font size=2>
<div class="likely" data-url="http://lib.it.cx" data-title="Библиотека Траума 2.33 ePub/FB2/Mobi" style="margin:10px;">
  <div class="twitter"></div>
  <div class="facebook"></div>
  <div class="gplus"></div>
  <div class="vkontakte"></div>
  <div class="telegram"></div>
  <div class="odnoklassniki"></div>
  <div class="linkedin"></div>
  <div class="whatsapp"></div>
  <div class="pinterest" data-media="i/pinnable.jpg"></div>
</div>
<pre style="margin:10px;">

Найдено в БД за 0.0159 с</div>
</footer>
</body>
<script>
var sel = document.getElementById("format-select");
sel.onchange = function() {
    var date = new Date( 2030,1,1 );
    var format = sel.value;
    document.cookie="book-format=" + format + "; path=/; expires="+date.toUTCString();
    location.reload();
};
</script>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-49300233-1', 'auto');
  ga('send', 'pageview');
</script>
</html>

2

Answers


  1. Chosen as BEST ANSWER

    Resolve problem with switching from 'html.parser' to 'lxml'. But if somebody know why html.parser didn't work correctly I would appreciate if you share.


  2. page.text is not properly formatted: the first tr element is never closed

    <table>
      <tr>
        <td>
          some tags
        </td>
      <tr>             <------------ first <tr> is not closed with </tr>
        <td>some tags
        </td>
      </tr>
    </table>
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search