skip to Main Content

I have a data frame which has 2 columns Id and Loan_Info. Loan_Info column stores data into JSON format which I need to convert it into data frame where every key would be new column and corresponding key value would represent the data for that column. My data frame looks like this.

Id Loan_Info
1 {‘111222-aaabbb’: [{‘loan_amt’: ‘100000’, ‘credit_amt’: ‘291000’, ‘loan_date’: ’01-11-2019′, ‘Bank_Name’: ‘HDFC’}, {‘loan_amt’: ‘200000’, ‘credit_amt’: ‘5850000’, ‘loan_date’: ’02-11-2019′, ‘Bank_Name’: ‘ICICI’}], ‘111333-cccddd’: [{‘loan_amt’: ‘14000’, ‘credit_amt’: ‘3080000’, ‘loan_date’: ’26-04-2023′, ‘Bank_Name’: ‘HDFC’}]}
2 {‘2226473-andis’: [{‘loan_amt’: ‘14000’, ‘credit_amt’: ‘3080000’, ‘loan_date’: ’26-04-2023′, ‘Bank_Name’: ‘IDFC’}]}

So my final output data should appear like this as shown below:

Id id_test loan_amt credit_amt loan_date Bank_Name
1 111222-aaabbb 100000 291000 01-11-2019 HDFC
1 111222-aaabbb 200000 5850000 02-11-2019 ICICI
1 111333-cccddd 14000 3080000 26-04-2023 HDFC
2 2226473-andis 14000 3080000 26-04-2023 IDFC

If JSON format column is in presented in dictionary type then I am able to create above format data however this only works with Dictionary type data and not when data frame is given which is given below

data = {'111222-aaabbb': [
    {'loan_amt': '100000', 'credit_amt': '291000', 'loan_date': '01-11-2019', 'Bank_Name': 'HDFC'}, 
    {'loan_amt': '200000', 'credit_amt': '5850000', 'loan_date': '02-11-2019', 'Bank_Name': 'ICICI'}], 
    '111333-cccddd': [{'loan_amt': '14000', 'credit_amt': '3080000', 'loan_date': '26-04-2023', 'Bank_Name': 'HDFC'}], 
'2226473-andis': [{'loan_amt': '14000', 'credit_amt': '3080000', 'loan_date': '26-04-2023', 'Bank_Name': 'IDFC'}]}

# Convert JSON to DataFrame
dfs = []
for key in data.keys():
    df = pd.json_normalize(data[key])
    df['entity'] = key
    dfs.append(df)

# Concatenate DataFrames
result = pd.concat(dfs, ignore_index=True)
result.head()

However the above code does not work in a data frame. I have 10000+ observation data frame and need to achieve above result. Kindly help me with this.

Any help is appreciated.

3

Answers


  1. If I understood correctly you are trying to transform a DataFrame with JSON data in the "Loan_Info" column into a new DataFrame where each key-value pair in the JSON is extracted into separate columns. To achieve this, you can follow a similar approach as the dictionary-based one, but you need to apply this transformation to each row of the original DataFrame.

    You can do this using the pandas library:

    import pandas as pd
    
    # Sample data
    data = {'Id': [1, 2],
            'Loan_Info': [{'111222-aaabbb': [{'loan_amt': '100000', 'credit_amt': '291000', 'loan_date': '01-11-2019', 'Bank_Name': 'HDFC'}, {'loan_amt': '200000', 'credit_amt': '5850000', 'loan_date': '02-11-2019', 'Bank_Name': 'ICICI'}],
                           '111333-cccddd': [{'loan_amt': '14000', 'credit_amt': '3080000', 'loan_date': '26-04-2023', 'Bank_Name': 'HDFC'}]},
                          '2226473-andis': [{'loan_amt': '14000', 'credit_amt': '3080000', 'loan_date': '26-04-2023', 'Bank_Name': 'IDFC'}]}]}
    
    # Create the initial DataFrame
    df = pd.DataFrame(data)
    
    # Convert Loan_Info strings to dictionaries
    df['Loan_Info'] = df['Loan_Info'].apply(json.loads)
    
    # Transform Loan_Info column
    dfs = []
    for index, row in df.iterrows():
        id_value = row['Id']
        loan_info = row['Loan_Info']
        
        for key, values in loan_info.items():
            for value in values:
                value['id_test'] = key
                value['Id'] = id_value
                dfs.append(value)
    
    # Create the final DataFrame
    result = pd.DataFrame(dfs)
    
    # Reorder columns if needed
    column_order = ['Id', 'id_test', 'loan_amt', 'credit_amt', 'loan_date', 'Bank_Name']
    result = result[column_order]
    
    print(result)
    

    Replace the sample data with your actual DataFrame, and this code should extract the necessary information from the "Loan_Info" column and create the desired DataFrame with separate columns for each key-value pair within the JSON structure.

    Login or Signup to reply.
  2. Use list with dict comprehension for list of dictionaries with add id and id_test keys and last pass to DataFrame constructor:

    L = [{**{'Id':i, 'id_test':k}, **y} 
             for i, x in zip(df['Id'], df['Loan_Info']) 
             for k, v in x.items() 
             for y in v]
    

    EDIT: You can convert strings values to dictionaries by ast.literal_eval:

    import ast
    
    L = [{**{'Id':i, 'id_test':k}, **y} 
         for i, x in zip(df['Id'], df['Loan_Info']) 
         for k, v in ast.literal_eval(x).items() 
         for y in v]
    

    df = pd.DataFrame(L)
    print (df)
       Id        id_test loan_amt credit_amt   loan_date Bank_Name
    0   1  111222-aaabbb   100000     291000  01-11-2019      HDFC
    1   1  111222-aaabbb   200000    5850000  02-11-2019     ICICI
    2   1  111333-cccddd    14000    3080000  26-04-2023      HDFC
    3   2  2226473-andis    14000    3080000  26-04-2023      IDFC
    

    Performance in repeated sample data with concat and json_normalize is slow:

    #2k rows
    df = pd.concat([df] * 1000, ignore_index=True)
    
    In [91]: %timeit pd.concat([pd.json_normalize(v).assign(Id=i, id_test=k) for i, d in zip(df['Id'], df['Loan_Info'].apply(ast.literal_eval)) for k,v in d.items()])
    2.39 s ± 74.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [92]: %timeit pd.DataFrame([{**{'Id':i, 'id_test':k}, **y}  for i, x in zip(df['Id'], df['Loan_Info'])  for k, v in ast.literal_eval(x).items()   for y in v])
    130 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    EDIT1: If there is multiple columns in DataFrame use:

    L = [{**{'i':i, 'id_test':k}, **y} 
         for i, x in zip(df.index, df.pop('Loan_Info')) 
         for k, v in ast.literal_eval(x).items() 
         for y in v]
    
    df = df.join(pd.DataFrame(L).set_index('i'))
    print (df)
    
       Id        id_test loan_amt credit_amt   loan_date Bank_Name
    0   1  111222-aaabbb   100000     291000  01-11-2019      HDFC
    0   1  111222-aaabbb   200000    5850000  02-11-2019     ICICI
    0   1  111333-cccddd    14000    3080000  26-04-2023      HDFC
    1   2  2226473-andis    14000    3080000  26-04-2023      IDFC
    
    Login or Signup to reply.
  3. You can use a variation of your original approach:

    out = pd.concat([pd.json_normalize(v).assign(Id=i, id_test=k)
     for i, d in zip(df['Id'], df['Loan_Info'])
     for k,v in d.items()
     ])
    

    If you have strings:

    import ast
    
    out = pd.concat([pd.json_normalize(v).assign(Id=i, id_test=k)
     for i, d in zip(df['Id'], df['Loan_Info'].apply(ast.literal_eval))
     for k,v in d.items()
     ])
    

    Output:

      loan_amt credit_amt   loan_date Bank_Name  Id        id_test
    0   100000     291000  01-11-2019      HDFC   1  111222-aaabbb
    1   200000    5850000  02-11-2019     ICICI   1  111222-aaabbb
    0    14000    3080000  26-04-2023      HDFC   1  111333-cccddd
    0    14000    3080000  26-04-2023      IDFC   2  2226473-andis
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search