skip to Main Content

I am relatively new to JSON and after going through several tutorials I have decided to reach out here for additional advice. I am trying to convert the following JSON dictionary via Jupyternotebook/Pandas into a dataframe and later into a csv file (data is extracted from a public API).

Example JSON dictionary without indentation:

json_search_result = 
{'inactive_count': 0, 'links': {'self': '/company/08892216/officers'}, 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': [{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': '/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'}, 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}

With indentation for better visibility:

json_search_result = 
{'inactive_count': 0, 'links': {'self': '/company/08892216/officers'},
 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 
'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': 
[{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': 
'/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 
'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'},
 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 
'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}

I have tried among others json_normalize but I either get errors given this is a dictionary:

df2 =  pd.json_normalize(json_search_result)

Thank you for any advice on this issues!

2

Answers


  1. Please try this if this helps your problem.
    Good reference for further improvements: Convert list of dictionaries to a pandas DataFrame

    json_search_result = {'inactive_count': 0, 'links': {'self': '/company/08892216/officers'}, 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': [{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': '/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'}, 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}
    
    pprint(json_search_result)
    
    out = pd.DataFrame.from_dict(json_search_result, orient='index').T
    print(out)
    print(out.columns)
    
    Login or Signup to reply.
  2. I assume that you will also want to flatten your dataframe since it contains additional nested information. So start by transforming you json string to a dataframe:

    json_search_result = {'inactive_count': 0, 'links': {'self': '/company/08892216/officers'}, 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': [{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': '/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'}, 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}
    A = pd.DataFrame.from_dict(json_search_result, orient='index').T
    

    which gives:

    inactive_count                                   links  
    0              0  {'self': '/company/08892216/officers'}   
    
                                           etag          kind resigned_count  
    0  1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8  officer-list              0   
    
      total_results active_count items_per_page  
    0             1            1             35   
    
                                                   items start_index  
    0  [{'date_of_birth': {'year': 1964, 'month': 2},...           0  
    ​
    

    Now, this function will flatten ANY dataframe with nested json strings:

    
    def flatten_nested_json_df(df):
        df = df.reset_index()
        s = (df.applymap(type) == list).all()
        list_columns = s[s].index.tolist()
    
        s = (df.applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()
    
        while len(list_columns) > 0 or len(dict_columns) > 0:
            new_columns = []
    
            for col in dict_columns:
                exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
                exploded.index = df.index
                df = pd.concat([df, exploded], axis=1).drop(columns=[col])
                new_columns.extend(exploded.columns)  # inplace
    
            for col in list_columns:
                # print(f"exploding: {col}")
                df = df.drop(columns=[col]).join(df[col].explode().to_frame())
                new_columns.append(col)
    
            s = (df[new_columns].applymap(type) == list).all()
            list_columns = s[s].index.tolist()
    
            s = (df[new_columns].applymap(type) == dict).all()
            dict_columns = s[s].index.tolist()
        return df
    

    Applying it:

    flatten_nested_json_df(A)
    

    will result in:

      index inactive_count                                      etag  
    0      0              0  1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8   
    
               kind resigned_count total_results active_count items_per_page  
    0  officer-list              0             1            1             35   
    
      start_index                  links.self  ...  items.occupation  
    0           0  /company/08892216/officers  ...  Company Director   
    
      items.date_of_birth.year items.date_of_birth.month  
    0                     1964                         2   
    
                                        items.links.self  
    0  /company/08892216/appointments/Ao1-pIsFQHy0FVS...   
    
                        items.links.officer.appointments items.address.premises  
    0  /officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointm...                     19   
    
       items.address.locality  items.address.country items.address.address_line_1  
    0                  London         United Kingdom                 Norcott Road   
    
      items.address.postal_code  
    0                   N16 7EJ 
    

    As you can see, the nested fields have been made to columns:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1 entries, 0 to 0
    Data columns (total 25 columns):
     #   Column                            Non-Null Count  Dtype 
    ---  ------                            --------------  ----- 
     0   index                             1 non-null      int64 
     1   inactive_count                    1 non-null      object
     2   etag                              1 non-null      object
     3   kind                              1 non-null      object
     4   resigned_count                    1 non-null      object
     5   total_results                     1 non-null      object
     6   active_count                      1 non-null      object
     7   items_per_page                    1 non-null      object
     8   start_index                       1 non-null      object
     9   links.self                        1 non-null      object
     10  items.officer_role                1 non-null      object
     11  items.appointed_on                1 non-null      object
     12  items.nationality                 1 non-null      object
     13  items.country_of_residence        1 non-null      object
     14  items.name                        1 non-null      object
     15  items.occupation                  1 non-null      object
     16  items.date_of_birth.year          1 non-null      int64 
     17  items.date_of_birth.month         1 non-null      int64 
     18  items.links.self                  1 non-null      object
     19  items.links.officer.appointments  1 non-null      object
     20  items.address.premises            1 non-null      object
     21  items.address.locality            1 non-null      object
     22  items.address.country             1 non-null      object
     23  items.address.address_line_1      1 non-null      object
     24  items.address.postal_code         1 non-null      object
    dtypes: int64(3), object(22)
    memory usage: 328.0+ bytes
    None
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search