skip to Main Content

I’m scraping job content from a website(https://www.104.com.tw/job/?jobno=66wee). As I send request, only part of the content in the ‘p’ element are returned.I want all the div class=”content” part.

my code :

  import requests
  from bs4 import BeautifulSoup

  payload = {'jobno':'66wee'}
  headers = {'user-agent': 'Mozilla/5.0 (Macintosh Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}
  r = requests.get('https://www.104.com.tw/job/',params = payload,headers = headers)
  soup=  BeautifulSoup(r.text, 'html.parser')
  contents = soup.findAll('div',{'class':'content'})  
  desctiprion = contents[0].findAll('p')[0].text.strip()
  print(desctiprion)

result(the job description part is missing):

4. Develop tools and systems that optimize analysis process efficiency and report quality.ion tools.row and succeed in a cross screen era. Appier is formed by a passionate team of computer scientists and engineers with experience in AI, data analysis, distributed systems, and marketing. Our colleagues come from Google, Intel, Yahoo, as well as renowned AI research groups in Harvard University and Stanford University. Headquartered in Taiwan, Appier serves more than 500 global brands and agencies from offices in international markets including Singapore, Japan, Australia, Hong Kong, Vietnam, India, Indonesia and South Korea.

but the html code of this part is :

    <div class="content">
      <p>Appier is a technology company that makes it easy for businesses to use artificial intelligence to grow and succeed in a cross screen era. Appier is formed by a passionate team of computer scientists and engineers with experience in AI, data analysis, distributed systems, and marketing. Our colleagues come from Google, Intel, Yahoo, as well as renowned AI research groups in Harvard University and Stanford University. Headquartered in Taiwan, Appier serves more than 500 global brands and agencies from offices in international markets including Singapore, Japan, Australia, Hong Kong, Vietnam, India, Indonesia and South Korea.
<br>
<br>Job Description
<br>1. Perform data analysis to help Appier teams to answer business or operational questions.
<br>2. Interpret trends or patterns from complex data sets by using statistical and visualization tools.
<br>3. Conduct data analysis reports to illustrate the results and insight
<br>4. Develop tools and systems that optimize analysis process efficiency and report quality.</p>

3

Answers


  1. You are accesing only the first p element with the second [0] indexation:

    description = contents[0].findAll('p')[0].text.strip()
    

    You should iterate through all the p elements:

    description = ""
    for p in contents[0].findAll('p'):
        description += p.text.strip()
    
    print(description)
    
    Login or Signup to reply.
  2. import requests
    from bs4 import BeautifulSoup
    
    payload = {'jobno': '66wee'}
    headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}
    r = requests.get('https://www.104.com.tw/job/',
                     params=payload, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    contents = soup.findAll('div', {'class': 'content'})
    for content in contents[0].findAll('p')[0].text.splitlines():
        print(content)
    
    Login or Signup to reply.
  3. There is more within the first content class tag but assuming you want just up to the end of point 4 i.e. first child p tag, you can use a descendant combinator with class selector for parent element and element selector for child. Remove the p from the selector if you truly want everything.

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://www.104.com.tw/job/?jobno=66wee'
    res = requests.get(url)
    soup = BeautifulSoup(res.content, "lxml")
    s = soup.select_one('.content p').text
    print(s)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search