skip to Main Content

I have a project in webscrapping where I am trying to scrape some data from webpage. I chose a site called wykop.pl which is something like polish reddit let’s say.

The way my idea goes is that selenium opens the page, accepts cookies, closes the ad (if it pops up, it doesn’t appear 100% of the time) goes to the bottom of the page (optional, I don’t think it’s needed) and then clicks the next page button using the css selector.

This is my code

website = "https://wykop.pl/hity/roku/strona/1"

cookies_button_xpath = '''
//button[contains(@class,'qxOn2zvg e1sXLPUy')]''' #relative xpath for accepting cookies




service_chrome = Service(executable_path = chromepath) 
options_chrome = webdriver.ChromeOptions()
driver_chrome = webdriver.Chrome(service = service_chrome, options = options_chrome) # otwieramy chrome

driver_chrome.maximize_window() # mazimizes browser's window
driver_chrome.get(website) # opens a website

time.sleep(3) # sometimes there can be some delays when accessing website, one can specify waiting for couple of secs

content = driver_chrome.find_element('xpath',cookies_button_xpath) # finds the button
content.click() # clicks the button
#DZIALA
#next_page_class_next = driver_chrome.find_element_by_css_selector("li.next")

#usuniete, teraz to trzeba zrobic tak



# a css selector to target the next page button with the class "next"
next_page_button_css_selector = 'next > a'

try:
    # Wait for the close button of the ad to be visible
    close_ad_button = WebDriverWait(driver_chrome, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button[data-v-6fdb93ea]")))
    
    #if the ad apperas
    close_ad_button.click()
except:
    # If the ad doesn't appear 
    pass


# get us to the bottom of the page
driver_chrome.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# wait for the next page button to be clickable
next_page = WebDriverWait(driver_chrome, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, next_page_css_selector))).click()

this is the error:

---------------------------------------------------------------------------
TimeoutException                          Traceback (most recent call last)
Cell In[27], line 47
     45 driver_chrome.execute_script("window.scrollTo(0, document.body.scrollHeight);")
     46 # wait for the next page button to be clickable
---> 47 next_page = WebDriverWait(driver_chrome, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, next_page_css_selector))).click()

File ~miniconda3envsPiotrusLibsite-packagesseleniumwebdriversupportwait.py:105, in WebDriverWait.until(self, method, message)
    103     if time.monotonic() > end_time:
    104         break
--> 105 raise TimeoutException(message, screen, stacktrace)

TimeoutException: Message: 

I have tried using the xpath solution, the problem is the same

I have tried increasing the time from 10 seconds to 30 to 50 to 70. Nothing worked.

I have tried using other variations of the css selector like

next_page_css_selector = "li.next > a

doesn’t work

I know that the problem is on my side and I know that I’m close because it accepts cookies which I took from the Xpath.

I’d really appreciate if you tried replicating the code and seeing what’s wrong

2

Answers


  1. To get the links from different pages is easier to use their Ajax pagination API, e.g.:

    import requests
    
    url = "https://wykop.pl/api/v3/hits/links"
    params = {"limit": "20", "page": "1", "sort": "year"}
    headers = {
        "Authorization": "Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VybmFtZSI6Inc1Mzk0NzI0MDc0OCIsInVzZXItaXAiOiIxMDUzMTU3MTQ5Iiwicm9sZXMiOlsiUk9MRV9BUFAiXSwiYXBwLWtleSI6Inc1Mzk0NzI0MDc0OCIsImV4cCI6MTcxNDUwMjA5MX0.X2mUIzvmz5FSskFRzuVYX37yAJU9aTlZqI56VqZCvWY"
    }
    
    for params["page"] in range(1, 3):  # <-- increase number of pages here
        data = requests.get(url, params=params, headers=headers).json()
        for d in data["data"]:
            print(
                d["votes"]["count"], d["title"], f'{d["votes"]["up"]}/{d["votes"]["down"]}'
            )
            print(d["source"]["url"])
            print()
    

    Prints:

    
    ...
    
    5037 Kiedy ekstradycja Sebastiana M. do Polski? 5057/20
    https://wykop.pl/artykul/7003275/kiedy-ekstradycja-sebastiana-m-do-polski
    
    5040 Deweloperzy lobbują, aby usunąć wymóg ilości miejsc parkingowych na mieszkanie 5048/8
    https://www.money.pl/gospodarka/zmiany-w-lex-deweloper-branza-parkingowy-wymog-musi-zniknac-7000188460038656a.html
    
    5027 TEDE vs PiSowscy, ale to jest piękne xD 5187/160
    https://www.threads.net/@lechuczechu/post/C1K9rbwv2dQ
    
    4966 Policjant wyrywa telefon kierowcy niszcząc jego własność, wypiera się, ale wszys 4988/22
    
    4900 Apel - administracjo zablokuj dodawanie FAME MMA 5272/372 https://wykop.pl/link/7299981/darmowe-fame-mma-reborn-na-tym-dc-https-discord-gg-a5ranypbdv-darmowe-clout-mm
    Login or Signup to reply.
  2. The problem is your CSS selector is not correct.

    next > a
    

    This is looking for an HTML tag NEXT that has a child A tag. There is no NEXT HTML tag on the page.

    The relevant HTML is

    <li ... class="next">
        <a ...>&gt;</a>
    </li>
    

    I think what you meant is

    li.next > a
    

    This is looking for an HTML tag LI that has a class ‘next’ that has a child A tag. This now matches the next link at the bottom of the page.


    Additional feedback:

    1. Instead of declaring a new WebDriverWait() instance each time you use it, create one and reuse it. For example,

      wait = WebDriverWait(driver, 10)
      wait.until(EC.element_to_be_clickable((...)).click()
      
    2. I would suggest that if you are only going to use a locator once, don’t bother declaring it as a variable, e.g. instead of

      next_page_button_css_selector = 'next > a'
      ...
      next_page = WebDriverWait(driver_chrome, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, next_page_css_selector))).click()
      

      just use

      next_page = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "next > a"))).click()
      

      It keeps all your code together and makes it easier to read.

    3. If you need to declare a locator, declare a By instead of just a string. Instead of

      next_page_button_css_selector = 'next > a'
      

      use

      next_page_button_locator = (By.CSS_SELECTOR, 'next > a')
      

      That way the locator string and the type are declared in the same place making maintenance and reading/following the code easier.

    4. Don’t use time.sleep(). Instead add a WebDriverWait when you need a wait.

    5. As of Selenium 4.6, you no longer need to download and configure the driver. SeleniumManager does that for you now. This

      service_chrome = Service(executable_path = chromepath) 
      options_chrome = webdriver.ChromeOptions()
      driver_chrome = webdriver.Chrome(service = service_chrome, options = options_chrome) # otwieramy chrome
      

      turns into

      driver_chrome = webdriver.Chrome() # otwieramy chrome
      
    6. This is just a personal preference… but name your driver driver, not driver_chrome. You aren’t maintaining multiple drivers of different types so there’s no point in putting ‘chrome’ in the name. It’s short, faster to type, etc. If you ever do change to Firefox or another browser, you’ll need to rename this variable to match, etc. Just keep it simple…

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search