Unable to explode the nested list within a json using pd.json_normalize

ArunK
December 20, 2023
164 views
0 votes
2 Answers

I tried different ways to get this done, but unable to do so. Please help here.
I am trying to explode a nested json list using python pd.json_normalize. After exploding the output is coming in two rows causing duplicate PKeys. My data is below and i am trying to get a single record with "id" as pkey and values in the nested list

{
"data": [
    {
        "id": "123456",
        "created_by": "[email protected]",
        "created_at": "2023-12-11",
        "modified_by": "[email protected]",
        "modified_at": "2023-12-11",
        "rg": {
            "id": "xyz234",
            "parent_id": "cpqrst",
            "account_id": "ntb1we",
 "values": [
            {
                "value1": 100,
                "value2": 101,
                "value10": {
                    "valueid": "btqns",
            },
            {
                "value1": 200,
                "value2": 201,
                "value10": {
                    "valueid": "apqns",
                
            }       
    }
],
"page_info": {
    "page": 1,
    "per_page": 100,
    "total_pages": 286,
    "total_count": 28580
}

}

my code as below to normalize and explode

response = requests.get(getURL, params={"page": page}, headers=headers)
    if response.status_code == 200:
        json_data = response.json()
        target_schema  = StructType([StructField(.................. StructField('version', LongType(), True)])

        normalized_data = pd.json_normalize(json_data,["data"],"page_info"]).explode("values").reset_index(drop=True)
        
        JSONdf = json.dumps(json.loads(normalized_data.to_json(orient="records")))
        results.append(JSONdf)

This provides data in the format as below with id value 123456 repeating twice for values

Current output with id 123456 repeating in 2 rows for values.value1 and values.value2

desired output is to have a single row with id 123456 for values column having the entire list

Any help greatly appreciated in getting the desired output. I tried all possible ways but could not get the desired output.

Answers

You can achieve this by adjusting the column names inside the pd.json_normalize function according to the actual structure of your JSON data.

import pandas as pd

# Sample corrected JSON data
json_data = {
    "data": [
        {
            "id": "123456",
            "created_by": "[email protected]",
            "created_at": "2023-12-11",
            "modified_by": "[email protected]",
            "modified_at": "2023-12-11",
            "rg": {
                "id": "xyz234",
                "parent_id": "cpqrst",
                "account_id": "ntb1we",
                "values": [
                    {
                        "value1": 100,
                        "value2": 101,
                        "value10": {
                            "valueid": "btqns"
                        }
                    },
                    {
                        "value1": 200,
                        "value2": 201,
                        "value10": {
                            "valueid": "apqns"
                        }
                    }
                ]
            },
            "page_info": {
                "page": 1,
                "per_page": 100,
                "total_pages": 286,
                "total_count": 28580
            }
        }
    ]
}

# Flatten the JSON and explode the "values" column
df = pd.json_normalize(json_data['data'], 'rg.values', ['id', 'created_by', 'created_at', 'modified_by', 'modified_at', 'rg.id', 'rg.parent_id', 'rg.account_id'])

print(df)

IIUC use:

out = pd.json_normalize(json_data["data"])

print(out)
       id     created_by  created_at    modified_by modified_at   rg.id  
0  123456  [email protected]  2023-12-11  [email protected]  2023-12-11  xyz234   

  rg.parent_id rg.account_id  
0       cpqrst        ntb1we   

                                           rg.values  page_info.page  
0  [{'value1': 100, 'value2': 101, 'value10': {'v...               1   

   page_info.per_page  page_info.total_pages  page_info.total_count  
0                 100                    286                  28580

If need separator _:

out = pd.json_normalize(json_data["data"], sep='_')

print(out)
       id     created_by  created_at    modified_by modified_at   rg_id  
0  123456  [email protected]  2023-12-11  [email protected]  2023-12-11  xyz234   

  rg_parent_id rg_account_id  
0       cpqrst        ntb1we   

                                           rg_values  page_info_page  
0  [{'value1': 100, 'value2': 101, 'value10': {'v...               1   

   page_info_per_page  page_info_total_pages  page_info_total_count  
0                 100                    286                  28580

Please signup or login to give your own answer.

Click here to cancel reply.