skip to Main Content

I have a data structure which is a very complex/nested JSON.

After reading other solutions on SO, I am comfortable with the fact that I may need to make multiple DFs and merge them together (e.g json_normalize with multiple record paths).

{
  "topLevelIdentifier": "a86507",
  "activities": [
    {
      "activityId": "1f13d848",
      "taskList": [
        {
          "taskIdentifier": "c6ba-1012",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "3438"
        },
        {
          "taskIdentifier": "cc48-07bf",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "4b32"
        },
        {
          "taskIdentifier": "35b896",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "7bcca"
        }
      ],
      "sectionB": {
        "someCapacityA": {
          "totalVolume": {
            "value": 8415,
            "unit": "CU_CM"
          },
          "totalWeight": {
            "value": 1059.0,
            "unit": "GRAMS"
          }
        },
        "someCapacityB": {
          "totalVolume": {
            "value": 0.0,
            "unit": "CU_CM"
          },
          "totalWeight": {
            "value": 0.0,
            "unit": "GRAMS"
          }
        }
      },
 "sectionC": {....},
"sectionD": {....}, 
"sectonE":{...}
}]}

I can create a base for the merges by doing this:

with open('my_file.json') as simulation_file:    
    simulation_data = json.load(simulation_file) 
df_base = pd.json_normalize(simulation_data, 'activities',['topLevelIdentifier'])

Then normalizing the next part isn’t too bad:

df_taskList = pd.json_normalize(simulation_data['activities'],
                               'taskList',['activityId'])

It’s normalizing the next section that has me stuck. Here’s what I’m trying but is throwing an error:

df_sectionB = pd.json_normalize(simulation_data['activities'],
                                
                                'activityId',['taskList',['taskIdentifier',['taskTime',['unit', 'value']], 'taskLocationIdentifier']])

I’m basically trying to flatten out sectionBsuch that it has the activityId so that I can merge all the pieces together.

If there’s a simpler way to flatten out to the deepest layer in a complex format like this, I’m open to alternative approaches!

2

Answers


  1. You can try flatten_json. I had to remove Sections C, D, E from your example, so it’ s unclear if this will work as you need it to. Refer to documentation for more information.

    set your dictionary equal to test_json

    from flatten_json import flatten
    
    dic_flattened = (flatten(d, '.') for d in [test_json]) 'note test_json needs to be in square brackets or you'll get an error
    df = pd.DataFrame(dic_flattened)
    df
    
      topLevelIdentifier activities.0.activityId  ... activities.0.sectionB.someCapacityB.totalWeight.value  activities.0.sectionB.someCapacityB.totalWeight.unit
    0             a86507                1f13d848  ...                                               0.00000                                                 GRAMS
    
    [1 rows x 22 columns]
    
    ' or transpose it to see the columns better
    df.T
    
                                                                   0
    topLevelIdentifier                                        a86507
    activities.0.activityId                                 1f13d848
    activities.0.taskList.0.taskIdentifier                 c6ba-1012
    activities.0.taskList.0.taskTime.value                         0
    activities.0.taskList.0.taskTime.unit                    SECONDS
    activities.0.taskList.0.taskLocationIdentifier              3438
    activities.0.taskList.1.taskIdentifier                 cc48-07bf
    activities.0.taskList.1.taskTime.value                         0
    activities.0.taskList.1.taskTime.unit                    SECONDS
    activities.0.taskList.1.taskLocationIdentifier              4b32
    activities.0.taskList.2.taskIdentifier                    35b896
    activities.0.taskList.2.taskTime.value                         0
    activities.0.taskList.2.taskTime.unit                    SECONDS
    activities.0.taskList.2.taskLocationIdentifier             7bcca
    activities.0.sectionB.someCapacityA.totalVolume.value       8415
    activities.0.sectionB.someCapacityA.totalVolume.unit       CU_CM
    activities.0.sectionB.someCapacityA.totalWeight.value 1059.00000
    activities.0.sectionB.someCapacityA.totalWeight.unit       GRAMS
    activities.0.sectionB.someCapacityB.totalVolume.value    0.00000
    activities.0.sectionB.someCapacityB.totalVolume.unit       CU_CM
    activities.0.sectionB.someCapacityB.totalWeight.value    0.00000
    activities.0.sectionB.someCapacityB.totalWeight.unit       GRAMS
    
    Login or Signup to reply.
  2. You could add .explode("taskList") to the result of the first normalize.

    You can then normalize the taskList column, copy the base index, and concat the results.

    df_base = pd.json_normalize(
       simulation_data, 
       record_path="activities", 
       meta="topLevelIdentifier"
    ).explode("taskList")
    
    df = pd.concat(
       [ 
          df_base.drop(columns="taskList"),
          pd.json_normalize(df_base["taskList"]).set_index(df_base.index)
       ],
       axis=1
    )
    
    >>> df.T
                                                       0           0           0
    activityId                                  1f13d848    1f13d848    1f13d848
    sectionC                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
    sectionD                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
    sectionE                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
    sectionB.someCapacityA.totalVolume.value        8415        8415        8415
    sectionB.someCapacityA.totalVolume.unit        CU_CM       CU_CM       CU_CM
    sectionB.someCapacityA.totalWeight.value      1059.0      1059.0      1059.0
    sectionB.someCapacityA.totalWeight.unit        GRAMS       GRAMS       GRAMS
    sectionB.someCapacityB.totalVolume.value         0.0         0.0         0.0
    sectionB.someCapacityB.totalVolume.unit        CU_CM       CU_CM       CU_CM
    sectionB.someCapacityB.totalWeight.value         0.0         0.0         0.0
    sectionB.someCapacityB.totalWeight.unit        GRAMS       GRAMS       GRAMS
    topLevelIdentifier                            a86507      a86507      a86507
    taskIdentifier                             c6ba-1012   cc48-07bf      35b896
    taskLocationIdentifier                          3438        4b32       7bcca
    taskTime.value                                     0           0           0
    taskTime.unit                                SECONDS     SECONDS     SECONDS
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search