Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Load CSV into pandas and convert to JSON hierarchy based on column values

Tgr
January 11, 2023
247 views
0 votes
2 Answers

I have a CSV with thousands of hundreds of thousands of rows but basically looks like this

personal_id	location_type	location_number
1	‘company’	123
2	‘branch	321
1	‘branch	456
1	‘branch	567

The goal is to group everything by personal_id and beneath that have 2 lists of the location_number that are identified by the location_type

[
    {
        "personal_id": 1,
        "company": [123],
        "branch": [456, 567]
    },
    {
        "personal_id": 2,
        "branch": [321]
    }
]

I used python pandas because i’ve done something successful before but only at 1 filtering level and using pandas to_dict('records) worked perfectly at the time

ive been trying to do something in that light such as this

merge_df= (data_df.groupby(['personal_id'])
    .apply(lambda x: x[['regulator', 'employee_number', 'sex', 'status']]
        .to_dict('records'))
    .reset_index()
    .rename(columns={0: 'employee'}))

but im not able to figure out how to add an additional filter inside the apply() as well as this method creates a column which I dont need in the above scenario that I renamed to ’employee’

My only other option is to start everything over in C# with say CSVHelper and maybe automapper if pandas was the wrong choice

Answers

Try:

df = df.pivot_table(
    index="personal_id", columns="location_type", values="location_number", aggfunc=list
)

out = [out.append(row[row.notna()].to_dict()) for _, row in df.reset_index().iterrows()]
print(out)

Prints:

[
    {"personal_id": 1, "branch": [456, 567], "company": [123]},
    {"personal_id": 2, "branch": [321]},
]

You can do this:

# groupby personal_id and then in apply groupby and aggregate by list.
s = df.groupby("personal_id").apply(
    lambda x: x.groupby("location_type")["location_number"].agg(list).to_dict()
)
# then construct dict from series
out = [{**{"personal_dict": idx}, **v} for idx, v in zip(s.index, s)]

print(out)

[
    {"personal_dict": "1", "branch": [456, 567], "company": [123]},
    {"personal_dict": "2", "branch": [321]},
]

Please signup or login to give your own answer.

Click here to cancel reply.