I am try to get internet penetration data from the world bank and while parsing it for further processing, getting this error.
Here is the code:
import pandas as pd
import requests
import csv
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import json
url = 'https://api.worldbank.org/v2/country/all/indicator/IT.NET.USER.ZS'
params = {
'format': 'json',
'date': '1990:2022',
'per_page': '10000' # Maximum number of results per page
}
r = requests.get(url, params=params)
data = r.json()[1] # Index 1 contains the actual data
data_json = json.dumps(data)
# Parse the API response using BeautifulSoup
soup = BeautifulSoup(data_json, 'html.parser')
# Extract relevant data from the parsed response
parsed_data = []
for entry in soup.find_all('record'):
country_iso = entry.find('field', {'name': 'countryiso3code'}).get_text()
country_name = entry.find('field', {'name': 'country'}).get_text()
value = entry.find('field', {'name': 'value'}).get_text()
for date_entry in entry.find_all('data'):
date = date_entry.get('date')
parsed_data.append({
'countryiso3code': country_iso,
'country': country_name,
'date': date,
'value': value
})
# Create a DataFrame from the parsed data
df = pd.DataFrame(parsed_data)
df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
df = df[df['date'].astype(int) >= 1990]
Error:
KeyError Traceback (most recent call last)
Cell In[15], line 47
44 df = pd.DataFrame(parsed_data)
46 # Add the 'date' column to the DataFrame
---> 47 df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
49 # Filter data for the past 21 years as it's the first available data input to the World Bank
50 df = df[df['date'].astype(int) >= 1990]
File ~AppDataLocalProgramsPythonPython311Libsite-packagespandascoreframe.py:3761, in DataFrame.__getitem__(self, key)
3759 if self.columns.nlevels > 1:
3760 return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
3762 if is_integer(indexer):
3763 indexer = [indexer]
File ~AppDataLocalProgramsPythonPython311Libsite-packagespandascoreindexesrange.py:349, in RangeIndex.get_loc(self, key)
347 raise KeyError(key) from err
348 if isinstance(key, Hashable):
--> 349 raise KeyError(key)
350 self._check_indexing_error(key)
351 raise KeyError(key)
KeyError: 'date'
I am a beginner with this whole web scrapping stuff. Can someone help me out?
Tried changing some parsing code for the date but no luck.
2
Answers
This step doesn’t make sense – you have the data in JSON format, then convert it into a string, then parse it as HTML. But JSON is not HTML, so beautiful soup can’t parse this in a meaningful way.
When the code
soup.find_all('record')
runs, it finds no records, and therefore the loop runs 0 times.Instead, I would suggest something like this:
This converts JSON to a dataframe.
This step isn’t doing what you expect. It is converting the date to an int, which is the number of nanoseconds since 1970. This code is checking if the date is later than 0.002 ms after Jan 1, 1970.
You probably want to check the year, instead:
The problem is that you’re getting JSON data (
'format': 'json'
), parsing it to a Python object (r.json()
), converting it back to JSON string (json.dumps(data)
), then trying to parse it as if it was HTML (BeautifulSoup(data_json, 'html.parser')
).The result is that
parsed_data
is always empty.Skip the middle men, and operate on the parsed object directly:
Here I’ve created a
parsed_data
array in the same shape that you expected, but you could simply usedata
and change the name of the fields you use.