skip to Main Content

I’ve been using pandas’ json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109

I’m trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn’t prevent the TypeError that stems from trying to iterate through a null value.

Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!

import json
import pandas as pd

data = """[
    {
        "ID": "100",
        "Ats": {
            "Ats": [
                {
                    "Name": "At1",
                    "Desc": "Lazy At"
                }
            ]
        }
    },
    {
        "ID": "101",
        "Ats": null
    }
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()
TypeError: 'NoneType' object is not iterable

I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.

The desired output would look like this: Dataframe

2

Answers


  1. Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:

    import json
    
    import pandas as pd
    
    data = """
    [
        {
            "ID": "100",
            "Ats": {
                "Ats": [
                    {
                        "Name": "At1",
                        "Desc": "Lazy At"
                    }
                ]
            }
        },
        {
            "ID": "101",
            "Ats": null
        }
    ]"""
    
    data = json.loads(data)
    
    df = pd.DataFrame(data)
    df["Ats"] = df["Ats"].str["Ats"]
    df = df.explode("Ats")
    
    df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
    print(df)
    

    Prints:

        ID Name     Desc
    0  100  At1  Lazy At
    1  101  NaN      NaN
    
    Login or Signup to reply.
  2. This approach can be more efficient when it comes to dealing with large datasets.

    data = json.loads(data)
    desired_data = list(
        map(lambda x: pd.json_normalize(x, ["Ats", "Ats"], "ID").to_dict(orient="records")[0]
        if x["Ats"] is not None
        else {"ID": x["ID"], "Name": np.nan, "Desc": np.nan}, data))
        
    df = pd.DataFrame(desired_data)
    

    Output:

      Name     Desc   ID
    0  At1  Lazy At  100
    1  NaN      NaN  101
    

    You might want to consider using this simple try and except approach when working with small datasets. In this case, whenever an error is found it should append new row to DataFrame with NAN.

    Example:

    data = json.loads(data)
    df = pd.DataFrame()
    for item in data:
        try:
            df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID")) 
        except TypeError:
            df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)
    
    print(df)
    

    Output:

      Name     Desc   ID
    0  At1  Lazy At  100
    1  NaN      NaN  101
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search