skip to Main Content

I have a json string column in a dataframe that looks like this.

{"columns":["ApplicationNum","eads59Us01S","HouseDeal_flag","Liability_Asset_Ratio","CBRAvailPcnt","CMSFairIsaacScore","OweTaxes_or_IRAWithdrawalHistry","eads14Fi02S","GuarantorCount","CBRRevMon","CBRInstalMon","CMSApprovedToRequested","SecIncSource","eads59Us01S_4","Liability_Asset_Ratio_40_90","CBRAvailPcnt_20_95","CMSFairIsaacScore_Fund","eads14Fi02S_2","InstalMonthlyPayments_400_3k","RevolvingMonthlyPayments_1k_cap","ApprovedToRequested_0_100","NoSecIncome","coef_eads59Us01S_4","coef_HouseDeal_flag","coef_Liability_Asset_Ratio_40_90","coef_CBRAvailPcnt_20_95","coef_CMSFairIsaacScore_Fund","coef_OweTaxes_or_IRAWithdrawalHistry","coef_eads14Fi02S_2","coef_GuarantorCount","coef_RevolvingMonthlyPayments_1k_cap","coef_InstalMonthlyPayments_400_3k","coef_ApprovedToRequested_0_100","coef_NoSecIncome","coef_Intercept"],"data":[[569325.0,2,0.0,1,92,825,0.0,4,1.0,74,854,0.51,2,2.0,0.9,92.0,825.0,4.0,854.0,1000.0,0.51,0.0,0.11716245,0.299528064,0.392119645,-0.010826643,-0.004957868,0.339407077,0.061509795,0.3685047,0.000167603,0.000225742,0.902205454,-0.371734864,2.788087559]]}

I have a columns tag in there with a list of column values, and a data tag in there with the corresponding list of values for the column. How can I split this string column in my dataframe into multiple columns with each corresponding value?

3

Answers


  1. I can use rdd to get the columns, data and create dataframe with this.

    rdd = sc.textFile('test.txt')
    
    import json
    cols = rdd.map(lambda x: json.loads(x)['columns']).take(1)[0]
    data = rdd.map(lambda x: json.loads(x)['data']).take(1)[0]
    
    df = spark.createDataFrame(data, cols)
    df.show(truncate=False)
    
    +--------------+-----------+--------------+---------------------+------------+-----------------+-------------------------------+-----------+--------------+---------+------------+----------------------+------------+-------------+---------------------------+------------------+----------------------+-------------+----------------------------+-------------------------------+-------------------------+-----------+------------------+-------------------+--------------------------------+-----------------------+---------------------------+------------------------------------+------------------+-------------------+------------------------------------+---------------------------------+------------------------------+----------------+--------------+
    |ApplicationNum|eads59Us01S|HouseDeal_flag|Liability_Asset_Ratio|CBRAvailPcnt|CMSFairIsaacScore|OweTaxes_or_IRAWithdrawalHistry|eads14Fi02S|GuarantorCount|CBRRevMon|CBRInstalMon|CMSApprovedToRequested|SecIncSource|eads59Us01S_4|Liability_Asset_Ratio_40_90|CBRAvailPcnt_20_95|CMSFairIsaacScore_Fund|eads14Fi02S_2|InstalMonthlyPayments_400_3k|RevolvingMonthlyPayments_1k_cap|ApprovedToRequested_0_100|NoSecIncome|coef_eads59Us01S_4|coef_HouseDeal_flag|coef_Liability_Asset_Ratio_40_90|coef_CBRAvailPcnt_20_95|coef_CMSFairIsaacScore_Fund|coef_OweTaxes_or_IRAWithdrawalHistry|coef_eads14Fi02S_2|coef_GuarantorCount|coef_RevolvingMonthlyPayments_1k_cap|coef_InstalMonthlyPayments_400_3k|coef_ApprovedToRequested_0_100|coef_NoSecIncome|coef_Intercept|
    +--------------+-----------+--------------+---------------------+------------+-----------------+-------------------------------+-----------+--------------+---------+------------+----------------------+------------+-------------+---------------------------+------------------+----------------------+-------------+----------------------------+-------------------------------+-------------------------+-----------+------------------+-------------------+--------------------------------+-----------------------+---------------------------+------------------------------------+------------------+-------------------+------------------------------------+---------------------------------+------------------------------+----------------+--------------+
    |569325.0      |2          |0.0           |1                    |92          |825              |0.0                            |4          |1.0           |74       |854         |0.51                  |2           |2.0          |0.9                        |92.0              |825.0                 |4.0          |854.0                       |1000.0                         |0.51                     |0.0        |0.11716245        |0.299528064        |0.392119645                     |-0.010826643           |-0.004957868               |0.339407077                         |0.061509795       |0.3685047          |1.67603E-4                          |2.25742E-4                       |0.902205454                   |-0.371734864    |2.788087559   |
    +--------------+-----------+--------------+---------------------+------------+-----------------+-------------------------------+-----------+--------------+---------+------------+----------------------+------------+-------------+---------------------------+------------------+----------------------+-------------+----------------------------+-------------------------------+-------------------------+-----------+------------------+-------------------+--------------------------------+-----------------------+---------------------------+------------------------------------+------------------+-------------------+------------------------------------+---------------------------------+------------------------------+----------------+--------------+
    
    Login or Signup to reply.
  2. You can use the function json.loads, transform the json string into a dictionary with column-data-pairs and create new columns from this dictionary with .apply(pd.Series)

    import json
    import pandas as pd
    
    df = pd.DataFrame([["""{"columns":["ApplicationNum","eads59Us01S","HouseDeal_flag","Liability_Asset_Ratio","CBRAvailPcnt","CMSFairIsaacScore","OweTaxes_or_IRAWithdrawalHistry","eads14Fi02S","GuarantorCount","CBRRevMon","CBRInstalMon","CMSApprovedToRequested","SecIncSource","eads59Us01S_4","Liability_Asset_Ratio_40_90","CBRAvailPcnt_20_95","CMSFairIsaacScore_Fund","eads14Fi02S_2","InstalMonthlyPayments_400_3k","RevolvingMonthlyPayments_1k_cap","ApprovedToRequested_0_100","NoSecIncome","coef_eads59Us01S_4","coef_HouseDeal_flag","coef_Liability_Asset_Ratio_40_90","coef_CBRAvailPcnt_20_95","coef_CMSFairIsaacScore_Fund","coef_OweTaxes_or_IRAWithdrawalHistry","coef_eads14Fi02S_2","coef_GuarantorCount","coef_RevolvingMonthlyPayments_1k_cap","coef_InstalMonthlyPayments_400_3k","coef_ApprovedToRequested_0_100","coef_NoSecIncome","coef_Intercept"],"data":[[569325.0,2,0.0,1,92,825,0.0,4,1.0,74,854,0.51,2,2.0,0.9,92.0,825.0,4.0,854.0,1000.0,0.51,0.0,0.11716245,0.299528064,0.392119645,-0.010826643,-0.004957868,0.339407077,0.061509795,0.3685047,0.000167603,0.000225742,0.902205454,-0.371734864,2.788087559]]}"""]], columns=['json_string'])
    df['json_loads'] = df['json_string'].apply(json.loads)
    df['column_names'] = df['json_loads'].apply(lambda x: x['columns'])
    df['data'] = df['json_loads'].apply(lambda x: x['data'][0])
    
    # turning it into a dictionary
    df['dict_values']=df.apply(lambda x: dict(zip(x['column_names'],x['data'])), axis=1)
    
    df = pd.concat([df, df['dict_values'].apply(pd.Series)], axis=1)
    
    print(df.head())
    
    Login or Signup to reply.
  3. You provide data with a non-regular JSON string format in the form {columns –> [], data –> [ [ ] ] } (columns and data do not have the same length).

    So let’s assume data is a mere list data = [...] (like it would be in a json split orientation).

    Proposed script

    import pandas as pd
    
    
    data = {"columns":["ApplicationNum","eads59Us01S","HouseDeal_flag","Liability_Asset_Ratio","CBRAvailPcnt","CMSFairIsaacScore","OweTaxes_or_IRAWithdrawalHistry","eads14Fi02S","GuarantorCount","CBRRevMon","CBRInstalMon","CMSApprovedToRequested","SecIncSource","eads59Us01S_4","Liability_Asset_Ratio_40_90","CBRAvailPcnt_20_95","CMSFairIsaacScore_Fund","eads14Fi02S_2","InstalMonthlyPayments_400_3k","RevolvingMonthlyPayments_1k_cap","ApprovedToRequested_0_100","NoSecIncome","coef_eads59Us01S_4","coef_HouseDeal_flag","coef_Liability_Asset_Ratio_40_90","coef_CBRAvailPcnt_20_95","coef_CMSFairIsaacScore_Fund","coef_OweTaxes_or_IRAWithdrawalHistry","coef_eads14Fi02S_2","coef_GuarantorCount","coef_RevolvingMonthlyPayments_1k_cap","coef_InstalMonthlyPayments_400_3k","coef_ApprovedToRequested_0_100","coef_NoSecIncome","coef_Intercept"],
            "data":[569325.0,2,0.0,1,92,825,0.0,4,1.0,74,854,0.51,2,2.0,0.9,92.0,825.0,4.0,854.0,1000.0,0.51,0.0,0.11716245,0.299528064,0.392119645,-0.010826643,-0.004957868,0.339407077,0.061509795,0.3685047,0.000167603,0.000225742,0.902205454,-0.371734864,2.788087559]}
    
    df = pd.DataFrame(data)
    
    df.pivot_table(index=None, columns='columns', values='data')
    
    columns  ApplicationNum  ApprovedToRequested_0_100  ...  eads59Us01S  eads59Us01S_4
    data           569325.0                       0.51  ...          2.0            2.0
    
    [1 rows x 35 columns]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search