Html - How to scrape element with BeautifulSoup out of a table?

BenediktFaude
November 8, 2023
105 views
1 vote
2 Answers

I try to extract the content of the right side on this page:

https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=idn%3D1173921214

When we take a look on the html, the information is stored in this table:

With my code snippet, I can´t reach the text I want to.

def getDescriptionDNB():
    description = f'https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=9783125466302'
    response = requests.get(description)
    soupedDescription = BeautifulSoup(response.content, "html.parser")
    text = soupedDescription.find(class_="amount").text
    if text == "Treffer 1 von 1":
        autor = soupedDescription.find_all("tr")
        for i in autor:
            test = i.findNext("td").text
            print(test)

The problem is, I don´t know how to get down to the inner <td> tag to get the information I want to.

Do you know, how I can solve this Problem?

Answers

Main issue is – HTML of page is broken, there are some tr without td and without closing tag.

Try to select your elements more specific or try to store info in dict and pick by key.

Create a dict with css selectors:

...
dict(
    row.get_text(':',strip=True).split(':',1) 
    for row in soup.select('tr:has(td:not([colspan]))')
)

Create a dict with pandas.read_html():

import pandas as pd
url = f'https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=9783125466302'
pd.read_html(url)[0].dropna().set_index(0)[1].to_dict()

Output

Based on url of your snippet.

{'Link zu diesem Datensatz': 'https://d-nb.info/94985462X',
 'Titel': 'Learning English - Password red:Teil: Reformierte Rechtschreibung / 3. / [Hauptw.].',
 'Ausgabe': '1. Aufl., 1. Dr.',
 'Verlag': 'Stuttgart ; Düsseldorf ; Leipzig : Klett',
 'Zeitliche Einordnung': 'Erscheinungsdatum: 1997',
 'Umfang/Format': '172 S. ; 25 cm',
 'ISBN/Einband/Preis': '978-3-12-546630-2 Pp. : DM 29.60:3-12-546630-X Pp. : DM 29.60:3-12-54663-0 (falsch) Pp. : DM 29.60',
 'Sprache(n)': 'Englisch (eng), Deutsch (ger)',
 'Frankfurt': 'Signatur: 1997 A 10551:Bereitstellung  in Frankfurt',
 'Leipzig': 'Signatur: 1997 A 10551:Bereitstellung  in Leipzig'}

- JohnneyDarkness
- November 8, 2023 at 1:27 am
- 0 votes
0
You need to break apart the key/value pairs as pointed out. Sticking with BeautifulSoup (your tool of choice) –
```
        teilen = i.find_all('td')
        if len(teilen)==2:
              print(teilen[0].text.strip(), ' : ', teilen[1].text.strip())
```
There are some other things. Improve on this yourself. Instead if selecting all the ‘tr’s in the document select the table, and then select the table:
```
table id="fullRecordTable"
```
and then move on to selecting the rows (‘tr’) in there.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Html – How to scrape element with BeautifulSoup out of a table?

Answers

Output