skip to Main Content

I am a Python newbie, currently facing issues with my scraping code.
The script successfully accesses the website and avoids cookies.
However, it is unfortunately not copying the entire HTML code.

This is the full part of the HTML code on the website:

<div class="index__factor__Mo6xW p-base-regular">
  <h4 class="index__title__Rq0Po">Arbeitsatmosphäre</h4>
  <div class="index__block__7hodp index__scoreBlock__KZCPC">
    <span class="index__stars__nfK6S index__medium__CyRQn index__stars__bpFJl" data- fillcolor="butterscotch" data-score="5"></span>
  </div>
  <p class="index__plainText__JgbHE">Dynamisch</p>
</div>

And this is the code which is extracted:

<div class="index__factor__Mo6xW p-base-regular">
  <h4 class="index__title__Rq0Po">Work Atmosphere</h4>
  <p class="index__plainText__JgbHE">Dynamic</p>
</div> 

This is the code I already tried to extract:

url = "https://www.kununu.com/de/adidas/kommentare"
driver = webdriver.Chrome()
driver.get(url)

[...]

show_more_reviews(driver, 5)  #Code clicks on "Read more Reviews"
make_mini_scores_visible(driver) #Code shows al "Mini Scores" like "Arbeitsatmosphäre"

all_reviews = driver.execute_script("return document.documentElement.innerHTML;")
soup = BeautifulSoup(html, 'html.parser')

It is important that the whole code is extracted since I need every piece of information.

Thank you in advance!

2

Answers


  1. from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from selenium.webdriver.chrome.options import Options
    from bs4 import BeautifulSoup

    Rest of you code remain same

    soup = BeautifulSoup(html, 'html.parser')
    
    work_atmosphere_divs = soup.find_all('div', class_='index__factor__Mo6xW p-base-regular')
    for div in work_atmosphere_divs:
        title = div.find('h4', class_='index__title__Rq0Po').text.strip()
        atmosphere = div.find('p', class_='index__plainText__JgbHE').text.strip()
        print(f"Title: {title}, Atmosphere: {atmosphere}")
    
    driver.quit()
    

    you can try this this might help you

    Login or Signup to reply.
  2. The data you see on the page is stored inside <script> element in Json form, so you can use that:

    import json
    
    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.kununu.com/de/adidas/kommentare"
    
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    data = json.loads(soup.select_one("#__NEXT_DATA__").text)
    
    # print(json.dumps(data, indent=4))
    
    for r in data["props"]["initialReduxState"]["reviews"]["reviews"]:
        print(r["title"], r["score"])
        for rr in r["ratings"]:
            print(rr["id"], rr["score"], rr["text"])
        print()
    

    Prints:

    
    ...
    
    Noch weit weg von einer Verbesserung ///. Nur Show bei Adidas. 2.5
    atmosphere 3 Viel Druck. Performance System heiß MyBest. Die Vorgesetzten bekommen von der Personalabteilung und den VPs gesagt wie sie zu bewerten haben und niemand traut sich dagegen vorzugehen.<br/>Mitarbeitende werden nicht wertgeschätzt.
    image 4 Noch ok, aber überbewertet.
    career 2 Onlinekurse ohne Strategie
    salary 2 Nicht transparent und auch bei Jobangeboten keine Gehaltsinformationen.<br/>Sozialleistungen nur Standard.<br/>VWL 40.- Pro Monat
    oldColleagues 1 Erfahrene Mitarbeiterende verschwinden still und leise....
    leadership 1 Kaum Kompetenzen in der Mitarbeiterführung. Das zeigen auch die negativen Rückmeldungen der sich wiederholenden Mitarbeiterbefragungen. Fragt man nach dem Ergebnis des NPS.
    equality 3 Noch immer keine offenen Gehaltsinformationen nach Stellen und Positionen.<br/>Nasenfaktor zählt nur.
    workLife 3 None
    environment 3 None
    teamwork 3 None
    workConditions 3 None
    communication 2 None
    tasks 3 None
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search