Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Normalizing json using pandas with inconsistent nested lists/dictionaries

Payback
December 29, 2022
139 views
0 votes
2 Answers

I’ve been using pandas’ json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109

I’m trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn’t prevent the TypeError that stems from trying to iterate through a null value.

Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!

import json
import pandas as pd

data = """[
    {
        "ID": "100",
        "Ats": {
            "Ats": [
                {
                    "Name": "At1",
                    "Desc": "Lazy At"
                }
            ]
        }
    },
    {
        "ID": "101",
        "Ats": null
    }
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()

TypeError: 'NoneType' object is not iterable

I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.

The desired output would look like this: Dataframe

Answers

Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:

import json

import pandas as pd

data = """
[
    {
        "ID": "100",
        "Ats": {
            "Ats": [
                {
                    "Name": "At1",
                    "Desc": "Lazy At"
                }
            ]
        }
    },
    {
        "ID": "101",
        "Ats": null
    }
]"""

data = json.loads(data)

df = pd.DataFrame(data)
df["Ats"] = df["Ats"].str["Ats"]
df = df.explode("Ats")

df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
print(df)

Prints:

    ID Name     Desc
0  100  At1  Lazy At
1  101  NaN      NaN

This approach can be more efficient when it comes to dealing with large datasets.

data = json.loads(data)
desired_data = list(
    map(lambda x: pd.json_normalize(x, ["Ats", "Ats"], "ID").to_dict(orient="records")[0]
    if x["Ats"] is not None
    else {"ID": x["ID"], "Name": np.nan, "Desc": np.nan}, data))
    
df = pd.DataFrame(desired_data)

Output:

  Name     Desc   ID
0  At1  Lazy At  100
1  NaN      NaN  101

You might want to consider using this simple try and except approach when working with small datasets. In this case, whenever an error is found it should append new row to DataFrame with NAN.

Example:

data = json.loads(data)
df = pd.DataFrame()
for item in data:
    try:
        df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID")) 
    except TypeError:
        df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)

print(df)

Output:

  Name     Desc   ID
0  At1  Lazy At  100
1  NaN      NaN  101

Please signup or login to give your own answer.

Click here to cancel reply.