skip to Main Content

I am trying to split the data from one of the column in json / dict format to new rows and column.
e.g. Input

Column A Column B Column C
john blue [{city: "Manhattan", job_type: "PERM", sal_gp: 0d-1, age: 3.4}, {city: "Dallas", job_type: "TEMP", sal_gp: 0d-1, age: 1.4}]
jack gold [{city: "San Diego", job_type: "TEMP", sal_gp: 0d-1, age: 5.44}]

expected output

Column A Column B city job_type sal_gp age
john blue Manhattan PERM 0d-1 3.4
john blue Dallas TEMP 0d-1 1.4
Jack gold San Diego TEMP 0d-1 5.44

output of df.to_dict(‘tight’):

{'index': [0,1], 'columns': ['Column A', 'Column B', 'Column C'], 'data': [['john','blue',[{city: "Manhattan", job_type: "PERM", sal_gp: 0d-1, age: 3.4}, {city: "Dallas", job_type: "TEMP", sal_gp: 0d-1, age:
1.4}]],'jack', 'gold',  [{city: "San Diego", job_type: "TEMP", sal_gp: 0d-1, age: 5.44}]], 'index_names':[None], 'column_names': [None] }

I already tried json loads and pandas explode but it is not working.
I am not sure but is there any alternative solution for this?

2

Answers


  1. Here is the minimal reproducible example (that you should have provided)

    df=pd.DataFrame({'A':['john', 'jack'], 'B':['blue','gold'], 'C':[[{'city':'Manhattan', 'job':'perm'}, {'city':'Dallas', 'job':'temp'}], [{'city':'San Diego', 'job':'temp'}]]})
    

    I’ve simplified a bit the names and the fields, since this is what "minimal" examples are. And it is reproducible, because anybody can copy and paste the code and try it at home.

    A B C
    0 john blue [{‘city’: ‘Manhattan’, ‘job’: ‘perm’}, {‘city’: ‘Dallas’, ‘job’: ‘temp’}]
    1 jack gold [{‘city’: ‘San Diego’, ‘job’: ‘temp’}]

    The thing is, column C needs two layer of "expanding". One for the array, then another for the dicts in that array.

    Let’s start by exploding arrays

    dfc=df.explode('C').reset_index(drop=True)
    

    Gives

    A B C
    0 john blue {‘city’: ‘Manhattan’, ‘job’: ‘perm’}
    1 john blue {‘city’: ‘Dallas’, ‘job’: ‘temp’}
    2 jack gold {‘city’: ‘San Diego’, ‘job’: ‘temp’}

    So, half of the job is done: one row for each value in C array. Rest the expanding of fields of dict.

    We can find how to do it here

    dfc.join(pd.json_normalize(dfc.C)).drop(columns='C')
    

    json_normalize creates columns with fields of C. join add those new columns. And drop drops the C column now that we no longer need it

    Result is:

    A B city job
    0 john blue Manhattan perm
    1 john blue Dallas temp
    2 jack gold San Diego temp
    Login or Signup to reply.
  2. Code

    explode & apply

    df1 = df.explode('Column C', ignore_index=True)
    df1[['Column A', 'Column B']].join(df1['Column C'].apply(pd.Series))
    

    output:

        Column A    Column B    city        job_type    sal_gp  age
    0   john        blue        Manhattan   PERM        0d-1    3.40
    1   john        blue        Dallas      TEMP        0d-1    1.40
    2   jack        gold        San Diego   TEMP        0d-1    5.44
    

    If the results are not the same, please provide your example in code.

    Example Code

    import pandas as pd
    data1 = {'Column A': ['john', 'jack'], 'Column B': ['blue', 'gold'], 
             'Column C': [[{'city': "Manhattan", 'job_type': "PERM", 'sal_gp': '0d-1', 'age': 3.4}, 
                           {'city': "Dallas", 'job_type': "TEMP", 'sal_gp': '0d-1', 'age': 1.4}], 
                          [{'city': "San Diego", 'job_type': "TEMP", 'sal_gp': '0d-1', 'age': 5.44}]]}
    df = pd.DataFrame(data1)
    

    Intermediate

    df1

    Column A    Column B    Column C
    0   john    blue    {'city': 'Manhattan', 'job_type': 'PERM', 'sal...
    1   john    blue    {'city': 'Dallas', 'job_type': 'TEMP', 'sal_gp...
    2   jack    gold    {'city': 'San Diego', 'job_type': 'TEMP', 'sal...
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search