Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

When using Pandas json_normalize function, in the JSON i passed for some of the sections "record_path" not exists

SreekarR
May 22, 2024
120 views
0 votes
2 Answers

I have JSON data as follows:

json_data = [
    {
        "meta_info": {"id": 1, "name": "Alice"},
        "data": [{"value": 100}, {"value": 200}]
    },
    {
        "meta_info": {"id": 2, "name": "Bob"},
        "data": []
    },
    {
        "meta_info": {"id": 3, "name": "Charlie"}
        # "data" key is missing
    }
]

When I try to read it into a pd.DataFrame with pd.json_normalize:

df = pd.json_normalize(json_data, record_path='data', meta=['meta_info.id', 'meta_info.name'], errors='ignore')

I am getting the error:

KeyError: "Key 'data' not found. If specifying a record_path, all elements of data should have the path.

Based on json_data input above, my desired output as a dataframe looks like this:

value	meta_info.id	meta_info.name
100.0	1	Alice
200.0	1	Alice
NaN	2	Bob
NaN	3	Charlie

Answers

For this, you should iterate through the data in python (json_normalize is not flexible enough to cover this):

contents = []

for dictionary in json_data:
    meta_info = dictionary['meta_info']
    values = dictionary.get('data')
    if not values:
        values = [{'value':None}]
    values = [entry|meta_info for entry in values]
    contents.extend(values)

pd.DataFrame(contents)
   value  id     name
0  100.0   1    Alice
1  200.0   1    Alice
2    NaN   2      Bob
3    NaN   3  Charlie

- ouroboros1
- May 22, 2024 at 12:09 pm
- 0 votes
0
One approach could be to correct json_data before passing it to pd.json_normalize. Showing 2 options depending on python version (>= 3.5, >= 3.9):

Python 3.5

Since python 3.5 (examples at PEP 448), we can update via dictionary unpacking (**):
```
data = [
        {**d,
         **{'data': [{'value': np.nan}] 
            if d.get('data', []) == []
            else d.get('data')
            }
         } 
        for d in json_data
        ]

out = pd.json_normalize(data=data, 
                        record_path='data', 
                        meta=[['meta_info', 'id'], ['meta_info', 'name']]
                        )
```
Output:
```
   value meta_info.id meta_info.name
0  100.0            1          Alice
1  200.0            1          Alice
2    NaN            2            Bob
3    NaN            3        Charlie
```
Explanation

The value for our "update" dict (i.e., **{'data': ..} we determine as follows:
- For each dict (d), we try dict.get and default to [], thus aligning dicts without ‘data’ key (your 3rd dict) with the dicts that have {'data': []} (your 2nd dict).
- Now we check if the value equals [], if so, we overwrite with [{'value': np.nan}] (now applicable to both 2nd and 3rd dict), else we just keep the value as is (1st dict).
- The logic of {**d, **{'data': ...}} is that the later values (i.e. the right dict) will update the left values (the left dict). Doing this inside a for loop, for each d in json_data, we get the intermediate result:
```
# data

[{'meta_info': {'id': 1, 'name': 'Alice'},
  'data': [{'value': 100}, {'value': 200}]},
 {'meta_info': {'id': 2, 'name': 'Bob'}, 'data': [{'value': nan}]},
 {'meta_info': {'id': 3, 'name': 'Charlie'}, 'data': [{'value': nan}]}]
```
- This we can correctly pass to pd.json_normalize. Make sure to add meta fields correctly. Not: meta=['meta_info.id', ...], but: meta=[['meta_info','id'], ...].
Python 3.9

Since python 3.9 (examples at PEP 584) we can create data using the dict merge operator (|):
```
data = [d 
        if d.get('data', []) != [] 
        else d | {'data': [{'value': np.nan}]} 
        for d in json_data
        ]

# pd.json_normalize(data, ...) -> same result
```
- In this case, we keep d if d.get('data', []) != [] (1st dict), else we update d with {'data': [{'value': np.nan}]} via d | {'data': [{'value': np.nan}]} (2nd and 3rd dict).
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.