skip to Main Content

I try to extract the content of the right side on this page:

https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=idn%3D1173921214

enter image description here

When we take a look on the html, the information is stored in this table:
enter image description here

With my code snippet, I can´t reach the text I want to.

def getDescriptionDNB():
    description = f'https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=9783125466302'
    response = requests.get(description)
    soupedDescription = BeautifulSoup(response.content, "html.parser")
    text = soupedDescription.find(class_="amount").text
    if text == "Treffer 1 von 1":
        autor = soupedDescription.find_all("tr")
        for i in autor:
            test = i.findNext("td").text
            print(test)

The problem is, I don´t know how to get down to the inner <td> tag to get the information I want to.

Do you know, how I can solve this Problem?

2

Answers


  1. Main issue is – HTML of page is broken, there are some tr without td and without closing tag.

    Try to select your elements more specific or try to store info in dict and pick by key.

    Create a dict with css selectors:

    ...
    dict(
        row.get_text(':',strip=True).split(':',1) 
        for row in soup.select('tr:has(td:not([colspan]))')
    )
    

    Create a dict with pandas.read_html():

    import pandas as pd
    url = f'https://portal.dnb.de/opac.htm?method=simpleSearch&cqlMode=true&query=9783125466302'
    pd.read_html(url)[0].dropna().set_index(0)[1].to_dict()
    
    Output

    Based on url of your snippet.

    {'Link zu diesem Datensatz': 'https://d-nb.info/94985462X',
     'Titel': 'Learning English - Password red:Teil: Reformierte Rechtschreibung / 3. / [Hauptw.].',
     'Ausgabe': '1. Aufl., 1. Dr.',
     'Verlag': 'Stuttgart ; Düsseldorf ; Leipzig : Klett',
     'Zeitliche Einordnung': 'Erscheinungsdatum: 1997',
     'Umfang/Format': '172 S. ; 25 cm',
     'ISBN/Einband/Preis': '978-3-12-546630-2 Pp. : DM 29.60:3-12-546630-X Pp. : DM 29.60:3-12-54663-0 (falsch) Pp. : DM 29.60',
     'Sprache(n)': 'Englisch (eng), Deutsch (ger)',
     'Frankfurt': 'Signatur: 1997 A 10551:Bereitstellung  in Frankfurt',
     'Leipzig': 'Signatur: 1997 A 10551:Bereitstellung  in Leipzig'}
    
    Login or Signup to reply.
  2. You need to break apart the key/value pairs as pointed out. Sticking with BeautifulSoup (your tool of choice) –

            teilen = i.find_all('td')
            if len(teilen)==2:
                  print(teilen[0].text.strip(), ' : ', teilen[1].text.strip())
    

    There are some other things. Improve on this yourself. Instead if selecting all the ‘tr’s in the document select the table, and then select the table:

    table id="fullRecordTable"
    

    and then move on to selecting the rows (‘tr’) in there.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search