Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

How to split Json string column in Pandas Dataframe with multiple lists to multiple columns?

randomguy2443
February 14, 2023
130 views
2 votes
3 Answers

I have a json string column in a dataframe that looks like this.

{"columns":["ApplicationNum","eads59Us01S","HouseDeal_flag","Liability_Asset_Ratio","CBRAvailPcnt","CMSFairIsaacScore","OweTaxes_or_IRAWithdrawalHistry","eads14Fi02S","GuarantorCount","CBRRevMon","CBRInstalMon","CMSApprovedToRequested","SecIncSource","eads59Us01S_4","Liability_Asset_Ratio_40_90","CBRAvailPcnt_20_95","CMSFairIsaacScore_Fund","eads14Fi02S_2","InstalMonthlyPayments_400_3k","RevolvingMonthlyPayments_1k_cap","ApprovedToRequested_0_100","NoSecIncome","coef_eads59Us01S_4","coef_HouseDeal_flag","coef_Liability_Asset_Ratio_40_90","coef_CBRAvailPcnt_20_95","coef_CMSFairIsaacScore_Fund","coef_OweTaxes_or_IRAWithdrawalHistry","coef_eads14Fi02S_2","coef_GuarantorCount","coef_RevolvingMonthlyPayments_1k_cap","coef_InstalMonthlyPayments_400_3k","coef_ApprovedToRequested_0_100","coef_NoSecIncome","coef_Intercept"],"data":[[569325.0,2,0.0,1,92,825,0.0,4,1.0,74,854,0.51,2,2.0,0.9,92.0,825.0,4.0,854.0,1000.0,0.51,0.0,0.11716245,0.299528064,0.392119645,-0.010826643,-0.004957868,0.339407077,0.061509795,0.3685047,0.000167603,0.000225742,0.902205454,-0.371734864,2.788087559]]}

I have a columns tag in there with a list of column values, and a data tag in there with the corresponding list of values for the column. How can I split this string column in my dataframe into multiple columns with each corresponding value?

Answers

I can use rdd to get the columns, data and create dataframe with this.

rdd = sc.textFile('test.txt')

import json
cols = rdd.map(lambda x: json.loads(x)['columns']).take(1)[0]
data = rdd.map(lambda x: json.loads(x)['data']).take(1)[0]

df = spark.createDataFrame(data, cols)
df.show(truncate=False)

+--------------+-----------+--------------+---------------------+------------+-----------------+-------------------------------+-----------+--------------+---------+------------+----------------------+------------+-------------+---------------------------+------------------+----------------------+-------------+----------------------------+-------------------------------+-------------------------+-----------+------------------+-------------------+--------------------------------+-----------------------+---------------------------+------------------------------------+------------------+-------------------+------------------------------------+---------------------------------+------------------------------+----------------+--------------+
|ApplicationNum|eads59Us01S|HouseDeal_flag|Liability_Asset_Ratio|CBRAvailPcnt|CMSFairIsaacScore|OweTaxes_or_IRAWithdrawalHistry|eads14Fi02S|GuarantorCount|CBRRevMon|CBRInstalMon|CMSApprovedToRequested|SecIncSource|eads59Us01S_4|Liability_Asset_Ratio_40_90|CBRAvailPcnt_20_95|CMSFairIsaacScore_Fund|eads14Fi02S_2|InstalMonthlyPayments_400_3k|RevolvingMonthlyPayments_1k_cap|ApprovedToRequested_0_100|NoSecIncome|coef_eads59Us01S_4|coef_HouseDeal_flag|coef_Liability_Asset_Ratio_40_90|coef_CBRAvailPcnt_20_95|coef_CMSFairIsaacScore_Fund|coef_OweTaxes_or_IRAWithdrawalHistry|coef_eads14Fi02S_2|coef_GuarantorCount|coef_RevolvingMonthlyPayments_1k_cap|coef_InstalMonthlyPayments_400_3k|coef_ApprovedToRequested_0_100|coef_NoSecIncome|coef_Intercept|
+--------------+-----------+--------------+---------------------+------------+-----------------+-------------------------------+-----------+--------------+---------+------------+----------------------+------------+-------------+---------------------------+------------------+----------------------+-------------+----------------------------+-------------------------------+-------------------------+-----------+------------------+-------------------+--------------------------------+-----------------------+---------------------------+------------------------------------+------------------+-------------------+------------------------------------+---------------------------------+------------------------------+----------------+--------------+
|569325.0      |2          |0.0           |1                    |92          |825              |0.0                            |4          |1.0           |74       |854         |0.51                  |2           |2.0          |0.9                        |92.0              |825.0                 |4.0          |854.0                       |1000.0                         |0.51                     |0.0        |0.11716245        |0.299528064        |0.392119645                     |-0.010826643           |-0.004957868               |0.339407077                         |0.061509795       |0.3685047          |1.67603E-4                          |2.25742E-4                       |0.902205454                   |-0.371734864    |2.788087559   |
+--------------+-----------+--------------+---------------------+------------+-----------------+-------------------------------+-----------+--------------+---------+------------+----------------------+------------+-------------+---------------------------+------------------+----------------------+-------------+----------------------------+-------------------------------+-------------------------+-----------+------------------+-------------------+--------------------------------+-----------------------+---------------------------+------------------------------------+------------------+-------------------+------------------------------------+---------------------------------+------------------------------+----------------+--------------+

You can use the function json.loads, transform the json string into a dictionary with column-data-pairs and create new columns from this dictionary with .apply(pd.Series)

import json
import pandas as pd

df = pd.DataFrame([["""{"columns":["ApplicationNum","eads59Us01S","HouseDeal_flag","Liability_Asset_Ratio","CBRAvailPcnt","CMSFairIsaacScore","OweTaxes_or_IRAWithdrawalHistry","eads14Fi02S","GuarantorCount","CBRRevMon","CBRInstalMon","CMSApprovedToRequested","SecIncSource","eads59Us01S_4","Liability_Asset_Ratio_40_90","CBRAvailPcnt_20_95","CMSFairIsaacScore_Fund","eads14Fi02S_2","InstalMonthlyPayments_400_3k","RevolvingMonthlyPayments_1k_cap","ApprovedToRequested_0_100","NoSecIncome","coef_eads59Us01S_4","coef_HouseDeal_flag","coef_Liability_Asset_Ratio_40_90","coef_CBRAvailPcnt_20_95","coef_CMSFairIsaacScore_Fund","coef_OweTaxes_or_IRAWithdrawalHistry","coef_eads14Fi02S_2","coef_GuarantorCount","coef_RevolvingMonthlyPayments_1k_cap","coef_InstalMonthlyPayments_400_3k","coef_ApprovedToRequested_0_100","coef_NoSecIncome","coef_Intercept"],"data":[[569325.0,2,0.0,1,92,825,0.0,4,1.0,74,854,0.51,2,2.0,0.9,92.0,825.0,4.0,854.0,1000.0,0.51,0.0,0.11716245,0.299528064,0.392119645,-0.010826643,-0.004957868,0.339407077,0.061509795,0.3685047,0.000167603,0.000225742,0.902205454,-0.371734864,2.788087559]]}"""]], columns=['json_string'])
df['json_loads'] = df['json_string'].apply(json.loads)
df['column_names'] = df['json_loads'].apply(lambda x: x['columns'])
df['data'] = df['json_loads'].apply(lambda x: x['data'][0])

# turning it into a dictionary
df['dict_values']=df.apply(lambda x: dict(zip(x['column_names'],x['data'])), axis=1)

df = pd.concat([df, df['dict_values'].apply(pd.Series)], axis=1)

print(df.head())

You provide data with a non-regular JSON string format in the form {columns –> [], data –> [ [ ] ] } (columns and data do not have the same length).

So let’s assume data is a mere list data = [...] (like it would be in a json split orientation).

Proposed script

import pandas as pd


data = {"columns":["ApplicationNum","eads59Us01S","HouseDeal_flag","Liability_Asset_Ratio","CBRAvailPcnt","CMSFairIsaacScore","OweTaxes_or_IRAWithdrawalHistry","eads14Fi02S","GuarantorCount","CBRRevMon","CBRInstalMon","CMSApprovedToRequested","SecIncSource","eads59Us01S_4","Liability_Asset_Ratio_40_90","CBRAvailPcnt_20_95","CMSFairIsaacScore_Fund","eads14Fi02S_2","InstalMonthlyPayments_400_3k","RevolvingMonthlyPayments_1k_cap","ApprovedToRequested_0_100","NoSecIncome","coef_eads59Us01S_4","coef_HouseDeal_flag","coef_Liability_Asset_Ratio_40_90","coef_CBRAvailPcnt_20_95","coef_CMSFairIsaacScore_Fund","coef_OweTaxes_or_IRAWithdrawalHistry","coef_eads14Fi02S_2","coef_GuarantorCount","coef_RevolvingMonthlyPayments_1k_cap","coef_InstalMonthlyPayments_400_3k","coef_ApprovedToRequested_0_100","coef_NoSecIncome","coef_Intercept"],
        "data":[569325.0,2,0.0,1,92,825,0.0,4,1.0,74,854,0.51,2,2.0,0.9,92.0,825.0,4.0,854.0,1000.0,0.51,0.0,0.11716245,0.299528064,0.392119645,-0.010826643,-0.004957868,0.339407077,0.061509795,0.3685047,0.000167603,0.000225742,0.902205454,-0.371734864,2.788087559]}

df = pd.DataFrame(data)

df.pivot_table(index=None, columns='columns', values='data')

columns  ApplicationNum  ApprovedToRequested_0_100  ...  eads59Us01S  eads59Us01S_4
data           569325.0                       0.51  ...          2.0            2.0

[1 rows x 35 columns]

Please signup or login to give your own answer.

Click here to cancel reply.