skip to Main Content

I’m trying to scrape all the info on an artist’s billboard page as it relates to their singles and how they performed. I’m trying to re-engineer a solution I’ve seen elsewhere.. It works up to a certain point but once I get past "peak pos" I don’t know how to include "peak date" and "wks" from the table. I’m basically trying to capture all the info as it appears in the table on the website and eventually put that in a dataframe but can’t get the last two columns. Any pointers will be greatly appreciated. Thanks!

import requests
from bs4 import BeautifulSoup

url = requests.get('https://www.billboard.com/artist/john-lennon/chart-history/hsi/')
soup = BeautifulSoup(url.content, 'html.parser')
result = soup.find_all('div','o-chart-results-list-row')

for res in result:
    song = res.find('h3').text.strip()
    artist = res.find('h3').find_next('span').text.strip()
    debute = res.find('span').find_next('span').text.strip()
    peak = res.find('a').find_next('span').text.strip()
    #peak_date = ?
    #wks = ?

    print("song: "+str(song))
    print("artist: "+ str(artist))
    print("debute: "+ str(debute))
    print("peak: "+ str(peak))
    print("___________________________________________________")

song: (Just Like) Starting Over
artist: John Lennon
debute: 11.01.80
peak: 1
peak_date:
wks:

3

Answers


  1. Try:

    import pandas as pd
    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.billboard.com/artist/john-lennon/chart-history/hsi/"
    
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    data = []
    for row in soup.select(".o-chart-results-list-row"):
        title = row.h3.get_text(strip=True)
        artist = row.span.get_text(strip=True)
        debut_date = row.select_one(".artist-chart-row-debut-date").get_text(strip=True)
        peak_pos = row.select_one(".artist-chart-row-peak-pos").get_text(strip=True)
        peak_week = row.select_one(".artist-chart-row-peak-week").get_text(strip=True)
        peak_date = row.select_one(".artist-chart-row-peak-date").get_text(strip=True)
        wks_on_chart = row.select_one(".artist-chart-row-week-on-chart").get_text(
            strip=True
        )
        data.append(
            {
                "Title": title,
                "Artist": artist,
                "Debut Date": debut_date,
                "Peak Pos": peak_pos,
                "Peak Week": peak_week,
                "Weeks on Chart": wks_on_chart,
            }
        )
    
    
    df = pd.DataFrame(data)
    print(df)
    

    Prints:

                                   Title                                                            Artist Debut Date Peak Pos Peak Week Weeks on Chart
    0          (Just Like) Starting Over                                                       John Lennon   11.01.80        1     5 WKS             22
    1                              Woman                                                       John Lennon   01.17.81        2    12 Wks             20
    2                Watching The Wheels                                                       John Lennon   03.28.81       10    12 Wks             17
    3   Whatever Gets You Thru The Night                     John Lennon With The Plastic Ono Nuclear Band   09.28.74        1     1 WKS             15
    4                     Nobody Told Me                                                       John Lennon   01.21.84        5    12 Wks             14
    5    Instant Karma (We All Shine On)                                                   John Ono Lennon   02.28.70        3    12 Wks             13
    6                         MIND GAMES                                                       John Lennon   11.10.73       18    12 Wks             13
    7                           #9 Dream                                                       John Lennon   12.21.74        9    12 Wks             12
    8                        Cold Turkey                                                  Plastic Ono Band   11.15.69       30    12 Wks             12
    9                            Imagine                                      John Lennon/Plastic Ono Band   10.23.71        3    12 Wks              9
    10               Give Peace A Chance                                                  Plastic Ono Band   07.26.69       14    12 Wks              9
    11               Power To The People            John Lennon/Plastic Ono Band Yoko Ono/Plastic Ono Band   04.03.71       11    12 Wks              9
    12                       Stand By Me                                                       John Lennon   03.15.75       20    12 Wks              9
    13                            Mother            John Lennon/Plastic Ono Band Yoko Ono/Plastic Ono Band   01.09.71       43    12 Wks              6
    14          Happy Xmas (War Is Over)  John & Yoko/The Plastic Ono Band With The Harlem Community Choir   12.29.18       38    12 Wks              6
    15                  I'm Steppin' Out                                                       John Lennon   03.31.84       55    12 Wks              6
    16  Woman Is The Nigger Of The World               John Lennon/Plastic Ono Band With Elephant's Memory   05.20.72       57    12 Wks              5
    17                       Jealous Guy                                John Lennon & The Plastic Ono Band   10.15.88       80    12 Wks              4
    
    Login or Signup to reply.
  2. I would check the source code from the page to check where is each column located and taking advantage of the class (in the case of peak_date you can find the value in the next <a>, and in the case of the weeks you can find it in the next <span> with "artist-chart-row-week-on-chart" as the specific class name).

    The whole code to get what you want is the below:

    import requests
    from bs4 import BeautifulSoup
    
    url = requests.get('https://www.billboard.com/artist/john-lennon/chart-history/hsi/')
    soup = BeautifulSoup(url.content, 'html.parser')
    result = soup.find_all('div','o-chart-results-list-row')
    
    for res in result:
        song = res.find('h3').text.strip()
        artist = res.find('h3').find_next('span').text.strip()
        debute = res.find('span').find_next('span').text.strip()
        peak = res.find('a').find_next('span').text.strip()
        peak_date = res.find('a').find_next('a').text.strip()
        wks = res.find_next('span','artist-chart-row-week-on-chart').text.strip()
    
        print("song: "+str(song))
        print("artist: "+ str(artist))
        print("debute: "+ str(debute))
        print("peak: "+ str(peak))
        print("peak_date: "+ str(peak_date))
        print("wks: "+ str(wks))    
        print("___________________________________________________")
    
    Login or Signup to reply.
  3. There are generally several options to access elements from the html document. One is chaining find/find_next like you did. This works and can be adopted to get the weeks and peak date that you are looking for.

    peak_date = res.find("a").find_next("a").text.strip()
    wks = res.find("a").find_next("a").find_next("span").text.strip()
    

    However, a much better solution would be to look for the elements directly by their class name. This will allow your script to work even when the order of the elements is changed, as long as the class names stay the same. It may look like this:

    peak_date = res.find("span", class_="artist-chart-row-peak-date").text.strip()
    wks = res.find("span", class_="artist-chart-row-week-on-chart").text.strip()
    

    The complete code would then be:

    import requests
    from bs4 import BeautifulSoup
    
    url = requests.get('https://www.billboard.com/artist/john-lennon/chart-history/hsi/')
    soup = BeautifulSoup(url.content, 'html.parser')
    result = soup.find_all('div','o-chart-results-list-row')
    
    for res in result:
        song = res.find('h3').text.strip()
        artist = res.find('h3').find_next('span').text.strip()
        debute = res.find('span').find_next('span').text.strip()
        peak = res.find('a').find_next('span').text.strip()
    
        # Sloppy solution by chaining find_next
        # peak_date = res.find("a").find_next("a").text.strip()
        # wks = res.find("a").find_next("a").find_next("span").text.strip()
    
        # Better solution by searching for elements with class name
        peak_date = res.find("span", class_="artist-chart-row-peak-date").text.strip()
        wks = res.find("span", class_="artist-chart-row-week-on-chart").text.strip()
    
        print("song: "+str(song))
        print("artist: "+ str(artist))
        print("debute: "+ str(debute))
        print("peak: "+ str(peak))
        print("peak date: " + str(peak_date))
        print("weeks: " + str(wks))
        print("___________________________________________________")
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search