Ubuntu - scrape the html page after click on a div tag using BeautifulSoup

Dinasour
December 20, 2024
207 views
0 votes
2 Answers

I got some troubles when scraping the questions and answers from websites:

https://tech12h.com/bai-hoc/trac-nghiem-lich-su-12-bai-1-su-hinh-thanh-trat-tu-gioi-moi-sau-chien-tranh-gioi-thu-hai

The problem is the answer only appear when I click on a div Xem đáp án (scroll down to the end) but it’s not a link, it just a div, I guess it use Javascript event trigger to render content after the click on div.

How can I deal with this with Beautifulsoup. I got confict driver problem with Selenium when I use on Ubuntu.

Thank you

Answers

Chosen as BEST ANSWER
- Dinasour
- December 20, 2024 at 8:14 am
- 0 votes
0
I finally solved this issue!

I couldn't run the driver due to a conflict between the chrome app version and the chrome-driver version (the one located in usr/local/bin).

I found this repository containing my chrome app version: https://github.com/dreamshao/chromedriver

I download the suitable version for me and put in the usr/local/bin folder.

(Edit)

You shouldn’t need to explicitly download a Chrome driver. Modern selenium versions handle it for you.

Hence, all you need is this:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement
from collections.abc import Iterable

OPTIONS = webdriver.ChromeOptions()
OPTIONS.add_argument("--headless=true")

URL = "https://tech12h.com/bai-hoc/trac-nghiem-lich-su-12-bai-1-su-hinh-thanh-trat-tu-gioi-moi-sau-chien-tranh-gioi-thu-hai"

CSS_P = "#accordionExample > p"
CSS_L = "#accordionExample > ul li"

def etext(e: WebElement) -> str:
    if t := e.text:
        return t.strip()
    p = e.get_property("textContent")
    if isinstance(p, str):
        return p.strip()
    return ""

def get_common(driver: webdriver.Chrome, css: str) -> Iterable[str]:
    wait = WebDriverWait(driver, 10)
    ec = EC.presence_of_all_elements_located
    selector = By.CSS_SELECTOR, css
    yield from map(etext, wait.until(ec(selector)))

def chunks(items: list[str], chunk=4) -> Iterable[list[str]]:
    for i in range(0, len(items), chunk):
        yield items[i : i + chunk]

if __name__ == "__main__":
    with webdriver.Chrome(OPTIONS) as driver:
        driver.get(URL)
        questions = list(get_common(driver, CSS_P))
        answers = list(get_common(driver, CSS_L))
        assert len(questions) * 4 == len(answers)
        qanda = dict(zip(questions, chunks(answers)))
        for k, v in qanda.items():
            print(k)
            for a in v:
                print(f"t{a}")

Output (partial):

Câu 1: Để kết thúc nhanh chiến tranh ở châu Âu và châu Á - Thái Bình - Dương, ba cường quốc đã thống nhất mục đích gì?
        A. Sử dụng bom nguyên tử để tiêu diệt phát xít Nhật.
        B. Hồng quân Liên Xô nhanh chóng tấn công vào tận sào huyệt của phát xít Đức ở Bec-lin.
        C. Tiêu diệt tận gốc chủ nghĩa phát xít Đức và quân phiệt Nhật.
        D. Tất cả các mục đích trên.
Câu 2: Sự kiện nào dẫn đến thành lập nước Cộng hòa Liên bang Đức?
        A. Nước Đức được hòan toàn thống nhất.
        B. Nước Đức đã tiêu diệt tận gốc chủ nghĩa phát xít.
        C. Mĩ, Anh, Pháp hợp nhất các vùng chiếm đóng.
        D. Tất cả các sự kiện trên.

Please signup or login to give your own answer.

Click here to cancel reply.

Ubuntu – scrape the html page after click on a div tag using BeautifulSoup

Answers