I’m trying to make a crawler for a Korean news website.
The weird thing is I have working code already. Following is the example.
import requests
from bs4 import BeautifulSoup
import telegram
url = 'http://www.thelec.kr/news/articleList.html?page=1&total=3836&box_idxno=&view_type=sm'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
search_result = soup.select_one('#user-container')
news_list = search_result.select('.article-veiw-body > .article-list > .article-list-content > .list-block > .list-titles >a')
contents = []
for news in news_list:
link = news['href']
title = news.text
contents.append("http://www.thelec.kr"+link + " " + title)
contents
I changed just the url and tag, like this:
import requests
from bs4 import BeautifulSoup
import telegram
url = 'https://news.daum.net/breakingnews/digital'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
search_result = soup.select_one('#kakaoContent')
news_list = search_result.select('.box_etc > .cMain > .mArticle > .box_etc > .list_news2 > .cont_thumb > a')
links = []
for news in news_list:
link = news['href']
links.append(link)
links
All of a sudden, the result is ‘[]’. Empty. I tried it on another website too, but same result, empty.
I dont’t understand. Both look just same. Why does one work, and another one doesn’t work?
2
Answers
Your selector is too narrow. Try:
You current second selector doesn’t work on the page for me. If you want to get the links to articles on the left hand side you need to change your css selector. For example, to the faster and accurate