skip to Main Content

I tried different ways to get this done, but unable to do so. Please help here.
I am trying to explode a nested json list using python pd.json_normalize. After exploding the output is coming in two rows causing duplicate PKeys. My data is below and i am trying to get a single record with "id" as pkey and values in the nested list

{
"data": [
    {
        "id": "123456",
        "created_by": "[email protected]",
        "created_at": "2023-12-11",
        "modified_by": "[email protected]",
        "modified_at": "2023-12-11",
        "rg": {
            "id": "xyz234",
            "parent_id": "cpqrst",
            "account_id": "ntb1we",
 "values": [
            {
                "value1": 100,
                "value2": 101,
                "value10": {
                    "valueid": "btqns",
            },
            {
                "value1": 200,
                "value2": 201,
                "value10": {
                    "valueid": "apqns",
                
            }       
    }
],
"page_info": {
    "page": 1,
    "per_page": 100,
    "total_pages": 286,
    "total_count": 28580
}

}

my code as below to normalize and explode

response = requests.get(getURL, params={"page": page}, headers=headers)
    if response.status_code == 200:
        json_data = response.json()
        target_schema  = StructType([StructField(.................. StructField('version', LongType(), True)])

        normalized_data = pd.json_normalize(json_data,["data"],"page_info"]).explode("values").reset_index(drop=True)
        
        JSONdf = json.dumps(json.loads(normalized_data.to_json(orient="records")))
        results.append(JSONdf)

This provides data in the format as below with id value 123456 repeating twice for values

Current output with id 123456 repeating in 2 rows for values.value1 and values.value2

current_output

desired output is to have a single row with id 123456 for values column having the entire list

desired_output

Any help greatly appreciated in getting the desired output. I tried all possible ways but could not get the desired output.

2

Answers


  1. You can achieve this by adjusting the column names inside the pd.json_normalize function according to the actual structure of your JSON data.

    import pandas as pd
    
    # Sample corrected JSON data
    json_data = {
        "data": [
            {
                "id": "123456",
                "created_by": "[email protected]",
                "created_at": "2023-12-11",
                "modified_by": "[email protected]",
                "modified_at": "2023-12-11",
                "rg": {
                    "id": "xyz234",
                    "parent_id": "cpqrst",
                    "account_id": "ntb1we",
                    "values": [
                        {
                            "value1": 100,
                            "value2": 101,
                            "value10": {
                                "valueid": "btqns"
                            }
                        },
                        {
                            "value1": 200,
                            "value2": 201,
                            "value10": {
                                "valueid": "apqns"
                            }
                        }
                    ]
                },
                "page_info": {
                    "page": 1,
                    "per_page": 100,
                    "total_pages": 286,
                    "total_count": 28580
                }
            }
        ]
    }
    
    # Flatten the JSON and explode the "values" column
    df = pd.json_normalize(json_data['data'], 'rg.values', ['id', 'created_by', 'created_at', 'modified_by', 'modified_at', 'rg.id', 'rg.parent_id', 'rg.account_id'])
    
    print(df)
    
    Login or Signup to reply.
  2. IIUC use:

    out = pd.json_normalize(json_data["data"])
    
    print(out)
           id     created_by  created_at    modified_by modified_at   rg.id  
    0  123456  [email protected]  2023-12-11  [email protected]  2023-12-11  xyz234   
    
      rg.parent_id rg.account_id  
    0       cpqrst        ntb1we   
    
                                               rg.values  page_info.page  
    0  [{'value1': 100, 'value2': 101, 'value10': {'v...               1   
    
       page_info.per_page  page_info.total_pages  page_info.total_count  
    0                 100                    286                  28580  
    

    If need separator _:

    out = pd.json_normalize(json_data["data"], sep='_')
    
    print(out)
           id     created_by  created_at    modified_by modified_at   rg_id  
    0  123456  [email protected]  2023-12-11  [email protected]  2023-12-11  xyz234   
    
      rg_parent_id rg_account_id  
    0       cpqrst        ntb1we   
    
                                               rg_values  page_info_page  
    0  [{'value1': 100, 'value2': 101, 'value10': {'v...               1   
    
       page_info_per_page  page_info_total_pages  page_info_total_count  
    0                 100                    286                  28580  
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search