skip to Main Content

I have a CSV with thousands of hundreds of thousands of rows but basically looks like this

personal_id location_type location_number
1 ‘company’ 123
2 ‘branch 321
1 ‘branch 456
1 ‘branch 567

The goal is to group everything by personal_id and beneath that have 2 lists of the location_number that are identified by the location_type

[
    {
        "personal_id": 1,
        "company": [123],
        "branch": [456, 567]
    },
    {
        "personal_id": 2,
        "branch": [321]
    }
]

I used python pandas because i’ve done something successful before but only at 1 filtering level and using pandas to_dict('records) worked perfectly at the time

ive been trying to do something in that light such as this

merge_df= (data_df.groupby(['personal_id'])
    .apply(lambda x: x[['regulator', 'employee_number', 'sex', 'status']]
        .to_dict('records'))
    .reset_index()
    .rename(columns={0: 'employee'}))

but im not able to figure out how to add an additional filter inside the apply() as well as this method creates a column which I dont need in the above scenario that I renamed to ’employee’

My only other option is to start everything over in C# with say CSVHelper and maybe automapper if pandas was the wrong choice

2

Answers


  1. Try:

    df = df.pivot_table(
        index="personal_id", columns="location_type", values="location_number", aggfunc=list
    )
    
    out = [out.append(row[row.notna()].to_dict()) for _, row in df.reset_index().iterrows()]
    print(out)
    

    Prints:

    [
        {"personal_id": 1, "branch": [456, 567], "company": [123]},
        {"personal_id": 2, "branch": [321]},
    ]
    
    Login or Signup to reply.
  2. You can do this:

    # groupby personal_id and then in apply groupby and aggregate by list.
    s = df.groupby("personal_id").apply(
        lambda x: x.groupby("location_type")["location_number"].agg(list).to_dict()
    )
    # then construct dict from series
    out = [{**{"personal_dict": idx}, **v} for idx, v in zip(s.index, s)]
    
    print(out)
    
    [
        {"personal_dict": "1", "branch": [456, 567], "company": [123]},
        {"personal_dict": "2", "branch": [321]},
    ]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search