skip to Main Content

I have JSON data as follows:

json_data = [
    {
        "meta_info": {"id": 1, "name": "Alice"},
        "data": [{"value": 100}, {"value": 200}]
    },
    {
        "meta_info": {"id": 2, "name": "Bob"},
        "data": []
    },
    {
        "meta_info": {"id": 3, "name": "Charlie"}
        # "data" key is missing
    }
]

When I try to read it into a pd.DataFrame with pd.json_normalize:

df = pd.json_normalize(json_data, record_path='data', meta=['meta_info.id', 'meta_info.name'], errors='ignore') 

I am getting the error:

KeyError: "Key 'data' not found. If specifying a record_path, all elements of data should have the path.

Based on json_data input above, my desired output as a dataframe looks like this:

value meta_info.id meta_info.name
100.0 1 Alice
200.0 1 Alice
NaN 2 Bob
NaN 3 Charlie

2

Answers


  1. For this, you should iterate through the data in python (json_normalize is not flexible enough to cover this):

    contents = []
    
    for dictionary in json_data:
        meta_info = dictionary['meta_info']
        values = dictionary.get('data')
        if not values:
            values = [{'value':None}]
        values = [entry|meta_info for entry in values]
        contents.extend(values)
    
    pd.DataFrame(contents)
       value  id     name
    0  100.0   1    Alice
    1  200.0   1    Alice
    2    NaN   2      Bob
    3    NaN   3  Charlie
    
    Login or Signup to reply.
  2. One approach could be to correct json_data before passing it to pd.json_normalize. Showing 2 options depending on python version (>= 3.5, >= 3.9):

    Python 3.5

    Since python 3.5 (examples at PEP 448), we can update via dictionary unpacking (**):

    data = [
            {**d,
             **{'data': [{'value': np.nan}] 
                if d.get('data', []) == []
                else d.get('data')
                }
             } 
            for d in json_data
            ]
    
    out = pd.json_normalize(data=data, 
                            record_path='data', 
                            meta=[['meta_info', 'id'], ['meta_info', 'name']]
                            )
    

    Output:

       value meta_info.id meta_info.name
    0  100.0            1          Alice
    1  200.0            1          Alice
    2    NaN            2            Bob
    3    NaN            3        Charlie
    

    Explanation

    The value for our "update" dict (i.e., **{'data': ..} we determine as follows:

    • For each dict (d), we try dict.get and default to [], thus aligning dicts without ‘data’ key (your 3rd dict) with the dicts that have {'data': []} (your 2nd dict).
    • Now we check if the value equals [], if so, we overwrite with [{'value': np.nan}] (now applicable to both 2nd and 3rd dict), else we just keep the value as is (1st dict).
    • The logic of {**d, **{'data': ...}} is that the later values (i.e. the right dict) will update the left values (the left dict). Doing this inside a for loop, for each d in json_data, we get the intermediate result:
    # data
    
    [{'meta_info': {'id': 1, 'name': 'Alice'},
      'data': [{'value': 100}, {'value': 200}]},
     {'meta_info': {'id': 2, 'name': 'Bob'}, 'data': [{'value': nan}]},
     {'meta_info': {'id': 3, 'name': 'Charlie'}, 'data': [{'value': nan}]}]
    
    • This we can correctly pass to pd.json_normalize. Make sure to add meta fields correctly. Not: meta=['meta_info.id', ...], but: meta=[['meta_info','id'], ...].

    Python 3.9

    Since python 3.9 (examples at PEP 584) we can create data using the dict merge operator (|):

    data = [d 
            if d.get('data', []) != [] 
            else d | {'data': [{'value': np.nan}]} 
            for d in json_data
            ]
    
    # pd.json_normalize(data, ...) -> same result
    
    • In this case, we keep d if d.get('data', []) != [] (1st dict), else we update d with {'data': [{'value': np.nan}]} via d | {'data': [{'value': np.nan}]} (2nd and 3rd dict).
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search