skip to Main Content

I have DataFrame like below:

data type:

  • COL1 – float
  • COL2 – int
  • COL3 – int
  • COL4 – float
  • COL5 – float
  • COL6 – object
  • COL7 – object

Source code:

a = pd.DataFrame()
a["COL1"] = [0.0, 800.0]
a["COL2"] = [2, 3]
a["COL3"] = [123, 444]
a["COL4"] = [1500.0, 1600.0]
a["COL5"] = [700.0, 850.0]
a["COL6"] = ['{"account": {"sector": 2, "other": 15}}', np.nan]
a["COL7"] = ['{"value": "ab"}', np.nan]

enter image description here

  • COL6 and COL7 contain JSON, COL6 contains nested JSON.
  • Furthermore there could be missings both in COL6 and COL7.
  • And I need to convert values from COL6 and COL7 to "normal" form, however I can not even imagine how to convert COL6 (nested JSON) to DataFrame form of column with value

Desire output:

In terms of outpur for COL7 it is like below, however I can not even imagine how should look output for COL6 ?

COL1  | COL2 | COL3 | COL4   | COL5  | value |
------|------|------|--------|-------|-------|
0.0   | 2    | 123  | 1500.0 | 700.0 | abc   |
800.0 | 3    | 444  | 1600.0 | 850.0 | NaN   |

How can I do that in Python Pandas ?

The following solution does not work: pd.json_normalize(df['COL7'].apply(ast.literal_eval)), ERROR: ValueError: malformed node or string: nan

Source code (be aware that if I read it in Pandas there is also NaN):

{'COL1': [0.0, 0.0, 0.0],
 'COL2': [2, 0, 33],
 'COL3': [2162561990, 2167912785, 599119703],
 'COL4': [1500.0, 500.0, 3500.0],
 'COL5': [750.0, 0.0, 3500.0],
 'COL6': ['{"account": {"sector": 4, "other": 10}
, "account_2": {"sector": 0, "other": 0}
, "account_3": {"sector": 6, "other": 8}}'],
 'COL7': ['{"value": "cc"
, "value_2": 15.58
, "value_3": 646}']}

2

Answers


  1. You can try something as below; where you will first try to convert json from nested to flat,

    more the error you were receiving that is because of nan values, so avoid that I have you if/else condition.

    Code:

    import pandas as pd
    import ast
    import json 
    
    for col in ['COL6', 'COL7']:
        a[col] = a[col].apply(lambda x: '' if pd.isnull(x) else list(pd.json_normalize(ast.literal_eval(x)).T.to_dict().values())[0])
    a
    

    #output

       COL1  COL2   COL3    COL4    COL5    COL6                           COL7
    0   0.0     2   123 1500.0     700.0    {'account.sector': 2, 'account.other': 15}  ab
    1   800.0   3   444 1600.0     850.0
    
        
    

    after flatting, I am trying to split that column and concat with our actual data.

    a = pd.concat([a, a['COL6'].apply(pd.Series).drop(0,axis=1)]], axis=1)
    a.columns = a.columns.str.split('.').str[-1]
    

    Output: you will get all columns, drop the unnecessary ones.

                    sector          other
    0                 2.0            15.0
    1                 NaN             NaN   
    
    Login or Signup to reply.
  2. Just for the fun of it, this might be a solution as well. By restructuring the data to dictionaries in a different format:

    import pandas as pd
    import json
    
    data = '''{
        "COL1": [0.0, 0.0, 0.0],
        "COL2": [2, 0, 33],
        "COL3": [2162561990, 2167912785, 599119703],
        "COL4": [1500.0, 500.0, 3500.0],
        "COL5": [750.0, 0.0, 3500.0],
        "COL6": [
            {
                "account": {"sector": 4, "other": 10}, 
                "account_2": {"sector": 0, "other": 0},
                "account_3": {"sector": 6, "other": 8}
            }
        ],
        "COL7": [
            {
                "value": "cc",
                "value_2": 15.58, 
                "value_3": 646}
            ]
        }
    
    '''
    
    d = json.loads(data)
    
    d1, d2, d3 = {}, {}, {}
    cols = []
    
    for k in list(d.keys()):
        if not isinstance(d[k][0], dict):
            d1[k] = d[k][0]
            d2[k] = d[k][1]
            d3[k] = d[k][2]
    
        else:
            cols = list(d[k][0].keys())
            d1[cols[0]] = d[k][0][cols[0]]
            d2[cols[1]] = d[k][0][cols[1]]
            d3[cols[2]] = d[k][0][cols[2]]
    
    df = pd.concat([pd.json_normalize(d1), pd.json_normalize(d2), pd.json_normalize(d3)], ignore_index = True))
    

    yields:

           COL1  COL2        COL3    COL4    COL5 value  account.sector  account.other  value_2  account_2.sector  account_2.other  value_3  account_3.sector  account_3.other
        0   0.0     2  2162561990  1500.0   750.0    cc             4.0           10.0      NaN               NaN              NaN      NaN               NaN              NaN
        1   0.0     0  2167912785   500.0     0.0   NaN             NaN            NaN    15.58               0.0              0.0      NaN               NaN              NaN
        2   0.0    33   599119703  3500.0  3500.0   NaN             NaN            NaN      NaN               NaN              NaN    646.0               6.0              8.0
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search