skip to Main Content

I am try to get internet penetration data from the world bank and while parsing it for further processing, getting this error.
Here is the code:


import pandas as pd
import requests
import csv
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import json

url = 'https://api.worldbank.org/v2/country/all/indicator/IT.NET.USER.ZS'
params = {
    'format': 'json',
    'date': '1990:2022',
    'per_page': '10000'  # Maximum number of results per page
}

r = requests.get(url, params=params)
data = r.json()[1]  # Index 1 contains the actual data
data_json = json.dumps(data)

# Parse the API response using BeautifulSoup
soup = BeautifulSoup(data_json, 'html.parser')

# Extract relevant data from the parsed response
parsed_data = []
for entry in soup.find_all('record'):
    country_iso = entry.find('field', {'name': 'countryiso3code'}).get_text()
    country_name = entry.find('field', {'name': 'country'}).get_text()
    value = entry.find('field', {'name': 'value'}).get_text()

    for date_entry in entry.find_all('data'):
        date = date_entry.get('date')

        parsed_data.append({
            'countryiso3code': country_iso,
            'country': country_name,
            'date': date,
            'value': value
        })

# Create a DataFrame from the parsed data
df = pd.DataFrame(parsed_data)

df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)

df = df[df['date'].astype(int) >= 1990]

Error:

KeyError                                  Traceback (most recent call last)
Cell In[15], line 47
     44 df = pd.DataFrame(parsed_data)
     46 # Add the 'date' column to the DataFrame
---> 47 df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
     49 # Filter data for the past 21 years as it's the first available data input to the World Bank
     50 df = df[df['date'].astype(int) >= 1990]

File ~AppDataLocalProgramsPythonPython311Libsite-packagespandascoreframe.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File ~AppDataLocalProgramsPythonPython311Libsite-packagespandascoreindexesrange.py:349, in RangeIndex.get_loc(self, key)
    347         raise KeyError(key) from err
    348 if isinstance(key, Hashable):
--> 349     raise KeyError(key)
    350 self._check_indexing_error(key)
    351 raise KeyError(key)

KeyError: 'date'

I am a beginner with this whole web scrapping stuff. Can someone help me out?

Tried changing some parsing code for the date but no luck.

2

Answers


  1. data_json = json.dumps(data)
    
    # Parse the API response using BeautifulSoup
    soup = BeautifulSoup(data_json, 'html.parser')
    

    This step doesn’t make sense – you have the data in JSON format, then convert it into a string, then parse it as HTML. But JSON is not HTML, so beautiful soup can’t parse this in a meaningful way.

    When the code soup.find_all('record') runs, it finds no records, and therefore the loop runs 0 times.

    Instead, I would suggest something like this:

    r = requests.get(url, params=params)
    data = r.json()[1]  # Index 1 contains the actual data
    df = pd.json_normalize(data)
    df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
    

    This converts JSON to a dataframe.


    df = df[df['date'].astype(int) >= 1990]
    

    This step isn’t doing what you expect. It is converting the date to an int, which is the number of nanoseconds since 1970. This code is checking if the date is later than 0.002 ms after Jan 1, 1970.

    You probably want to check the year, instead:

    df = df[df['date'].dt.year >= 1990]
    
    Login or Signup to reply.
  2. The problem is that you’re getting JSON data ('format': 'json'), parsing it to a Python object (r.json()), converting it back to JSON string (json.dumps(data)), then trying to parse it as if it was HTML (BeautifulSoup(data_json, 'html.parser')).

    The result is that parsed_data is always empty.

    Skip the middle men, and operate on the parsed object directly:

    r = requests.get(url, params=params)
    data = r.json()[1]  # Index 1 contains the actual data
    
    parsed_data = []
    for entry in data:
        parsed_data.append({
            'countryiso3code': entry['countryiso3code'],
            'country': entry['country']['value'],
            'date': entry['date'],
            'value': entry['value']
        })
    

    Here I’ve created a parsed_data array in the same shape that you expected, but you could simply use data and change the name of the fields you use.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search