skip to Main Content

I’m trying to make a crawler for a Korean news website.
The weird thing is I have working code already. Following is the example.

import requests
from bs4 import BeautifulSoup
import telegram

url = 'http://www.thelec.kr/news/articleList.html?page=1&total=3836&box_idxno=&view_type=sm'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html, 'html.parser')

search_result = soup.select_one('#user-container')
news_list = search_result.select('.article-veiw-body > .article-list > .article-list-content > .list-block > .list-titles >a')

contents = []
for news in news_list:
    link = news['href']
    title = news.text
    contents.append("http://www.thelec.kr"+link + " " + title)

contents

I changed just the url and tag, like this:

import requests
from bs4 import BeautifulSoup
import telegram

url = 'https://news.daum.net/breakingnews/digital'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html, 'html.parser')

search_result = soup.select_one('#kakaoContent')
news_list = search_result.select('.box_etc > .cMain > .mArticle > .box_etc > .list_news2 > .cont_thumb > a')

links = []
for news in news_list:
    link = news['href']
    links.append(link)

links

All of a sudden, the result is ‘[]’. Empty. I tried it on another website too, but same result, empty.
I dont’t understand. Both look just same. Why does one work, and another one doesn’t work?

2

Answers


  1. Your selector is too narrow. Try:

    soup.select('#kakaoContent .box_etc .list_news2 .cont_thumb a')
    
    Login or Signup to reply.
  2. You current second selector doesn’t work on the page for me. If you want to get the links to articles on the left hand side you need to change your css selector. For example, to the faster and accurate

    .list_news2 .tit_thumb >  a
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search