Problems with Python Web Scraping: Incomplete HTML Code Extraction

user23501892
February 29, 2024
86 views
0 votes
2 Answers

I am a Python newbie, currently facing issues with my scraping code.
The script successfully accesses the website and avoids cookies.
However, it is unfortunately not copying the entire HTML code.

This is the full part of the HTML code on the website:

<div class="index__factor__Mo6xW p-base-regular">
  <h4 class="index__title__Rq0Po">Arbeitsatmosphäre</h4>
  <div class="index__block__7hodp index__scoreBlock__KZCPC">
    <span class="index__stars__nfK6S index__medium__CyRQn index__stars__bpFJl" data- fillcolor="butterscotch" data-score="5"></span>
  </div>
  <p class="index__plainText__JgbHE">Dynamisch</p>
</div>

And this is the code which is extracted:

<div class="index__factor__Mo6xW p-base-regular">
  <h4 class="index__title__Rq0Po">Work Atmosphere</h4>
  <p class="index__plainText__JgbHE">Dynamic</p>
</div>

This is the code I already tried to extract:

url = "https://www.kununu.com/de/adidas/kommentare"
driver = webdriver.Chrome()
driver.get(url)

[...]

show_more_reviews(driver, 5)  #Code clicks on "Read more Reviews"
make_mini_scores_visible(driver) #Code shows al "Mini Scores" like "Arbeitsatmosphäre"

all_reviews = driver.execute_script("return document.documentElement.innerHTML;")
soup = BeautifulSoup(html, 'html.parser')

It is important that the whole code is extracted since I need every piece of information.

Thank you in advance!

Answers

- TejinderSingh
- February 29, 2024 at 11:51 am
- 0 votes
0
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

Rest of you code remain same
```
soup = BeautifulSoup(html, 'html.parser')

work_atmosphere_divs = soup.find_all('div', class_='index__factor__Mo6xW p-base-regular')
for div in work_atmosphere_divs:
    title = div.find('h4', class_='index__title__Rq0Po').text.strip()
    atmosphere = div.find('p', class_='index__plainText__JgbHE').text.strip()
    print(f"Title: {title}, Atmosphere: {atmosphere}")

driver.quit()
```
you can try this this might help you
Login or Signup to reply.

The data you see on the page is stored inside <script> element in Json form, so you can use that:

import json

import requests
from bs4 import BeautifulSoup

url = "https://www.kununu.com/de/adidas/kommentare"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").text)

# print(json.dumps(data, indent=4))

for r in data["props"]["initialReduxState"]["reviews"]["reviews"]:
    print(r["title"], r["score"])
    for rr in r["ratings"]:
        print(rr["id"], rr["score"], rr["text"])
    print()

Prints:


...

Noch weit weg von einer Verbesserung ///. Nur Show bei Adidas. 2.5
atmosphere 3 Viel Druck. Performance System heiß MyBest. Die Vorgesetzten bekommen von der Personalabteilung und den VPs gesagt wie sie zu bewerten haben und niemand traut sich dagegen vorzugehen.<br/>Mitarbeitende werden nicht wertgeschätzt.
image 4 Noch ok, aber überbewertet.
career 2 Onlinekurse ohne Strategie
salary 2 Nicht transparent und auch bei Jobangeboten keine Gehaltsinformationen.<br/>Sozialleistungen nur Standard.<br/>VWL 40.- Pro Monat
oldColleagues 1 Erfahrene Mitarbeiterende verschwinden still und leise....
leadership 1 Kaum Kompetenzen in der Mitarbeiterführung. Das zeigen auch die negativen Rückmeldungen der sich wiederholenden Mitarbeiterbefragungen. Fragt man nach dem Ergebnis des NPS.
equality 3 Noch immer keine offenen Gehaltsinformationen nach Stellen und Positionen.<br/>Nasenfaktor zählt nur.
workLife 3 None
environment 3 None
teamwork 3 None
workConditions 3 None
communication 2 None
tasks 3 None

Please signup or login to give your own answer.

Click here to cancel reply.