I have a bunch of data saved as json strings in Pandas DataFrame. I’d like to aggregate the DataFrame based on json data. Here’s some sample data:
data = {
'id': [1, 2, 3],
'name': ['geo1', 'geo2', 'geo3'],
'json_data': [
'{"year": [2000, 2001, 2002], "val": [10, 20, 30]}',
'{"year": [2000, 2001, 2005], "val": [50, 60, 70]}',
'{"year": [2000, 2001, 2002], "val": [80, 90, 85]}'
]
}
df = pd.DataFrame(data)
I’d like to aggregate by year
and calculate the median
of val. So, if the data were a column, it would be something like:
dff = df.groupby(['year'], as_index=False).agg({'val':'median'})
print(dff)
year val
2000 50
2001 60
2002 58
2005 70
In case of even #, round up the median. only integer values, no decimals.
5
Answers
Extract year and val from json_data, then group by year to get the result.
Output
To achieve the desired aggregation and median calculation you will first convert the JSON strings in the
json_data
column to actual Python dictionaries. Then, explode the dictionaries to create multiple rows for eachyear
andval
pair. Lastly, you will group byyear
and calculate the rounded up median ofÂval
:Here is the output:
Convert the elements in json_data from string to dictionary and append them to a dataframe
First convert JSON as string to dictionary and load as pandas dataframe
Then, explode into rows and groupby with the respective operation
Output
Using
ast.literal_eval
,json_normalize
,explode
: