Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Json – Getting an error while parsing the data in Jupyter notebook

SATYAMKUMAR
July 9, 2023
308 views
0 votes
2 Answers

I am try to get internet penetration data from the world bank and while parsing it for further processing, getting this error.
Here is the code:


import pandas as pd
import requests
import csv
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import json

url = 'https://api.worldbank.org/v2/country/all/indicator/IT.NET.USER.ZS'
params = {
    'format': 'json',
    'date': '1990:2022',
    'per_page': '10000'  # Maximum number of results per page
}

r = requests.get(url, params=params)
data = r.json()[1]  # Index 1 contains the actual data
data_json = json.dumps(data)

# Parse the API response using BeautifulSoup
soup = BeautifulSoup(data_json, 'html.parser')

# Extract relevant data from the parsed response
parsed_data = []
for entry in soup.find_all('record'):
    country_iso = entry.find('field', {'name': 'countryiso3code'}).get_text()
    country_name = entry.find('field', {'name': 'country'}).get_text()
    value = entry.find('field', {'name': 'value'}).get_text()

    for date_entry in entry.find_all('data'):
        date = date_entry.get('date')

        parsed_data.append({
            'countryiso3code': country_iso,
            'country': country_name,
            'date': date,
            'value': value
        })

# Create a DataFrame from the parsed data
df = pd.DataFrame(parsed_data)

df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)

df = df[df['date'].astype(int) >= 1990]

Error:

KeyError                                  Traceback (most recent call last)
Cell In[15], line 47
     44 df = pd.DataFrame(parsed_data)
     46 # Add the 'date' column to the DataFrame
---> 47 df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
     49 # Filter data for the past 21 years as it's the first available data input to the World Bank
     50 df = df[df['date'].astype(int) >= 1990]

File ~AppDataLocalProgramsPythonPython311Libsite-packagespandascoreframe.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File ~AppDataLocalProgramsPythonPython311Libsite-packagespandascoreindexesrange.py:349, in RangeIndex.get_loc(self, key)
    347         raise KeyError(key) from err
    348 if isinstance(key, Hashable):
--> 349     raise KeyError(key)
    350 self._check_indexing_error(key)
    351 raise KeyError(key)

KeyError: 'date'

I am a beginner with this whole web scrapping stuff. Can someone help me out?

Tried changing some parsing code for the date but no luck.

Answers

- NickODell
- July 9, 2023 at 12:01 am
- 0 votes
0
```
data_json = json.dumps(data)

# Parse the API response using BeautifulSoup
soup = BeautifulSoup(data_json, 'html.parser')
```
This step doesn’t make sense – you have the data in JSON format, then convert it into a string, then parse it as HTML. But JSON is not HTML, so beautiful soup can’t parse this in a meaningful way.

When the code soup.find_all('record') runs, it finds no records, and therefore the loop runs 0 times.

Instead, I would suggest something like this:
```
r = requests.get(url, params=params)
data = r.json()[1]  # Index 1 contains the actual data
df = pd.json_normalize(data)
df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
```
This converts JSON to a dataframe.
```
df = df[df['date'].astype(int) >= 1990]
```
This step isn’t doing what you expect. It is converting the date to an int, which is the number of nanoseconds since 1970. This code is checking if the date is later than 0.002 ms after Jan 1, 1970.

You probably want to check the year, instead:
```
df = df[df['date'].dt.year >= 1990]
```
Login or Signup to reply.

- BoppreH
- July 9, 2023 at 12:01 am
- 0 votes
0
The problem is that you’re getting JSON data ('format': 'json'), parsing it to a Python object (r.json()), converting it back to JSON string (json.dumps(data)), then trying to parse it as if it was HTML (BeautifulSoup(data_json, 'html.parser')).

The result is that parsed_data is always empty.

Skip the middle men, and operate on the parsed object directly:
```
r = requests.get(url, params=params)
data = r.json()[1]  # Index 1 contains the actual data

parsed_data = []
for entry in data:
    parsed_data.append({
        'countryiso3code': entry['countryiso3code'],
        'country': entry['country']['value'],
        'date': entry['date'],
        'value': entry['value']
    })
```
Here I’ve created a parsed_data array in the same shape that you expected, but you could simply use data and change the name of the fields you use.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.