Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

pandas dataframe aggregate based on json data

kms
August 26, 2023
304 views
2 votes
5 Answers

I have a bunch of data saved as json strings in Pandas DataFrame. I’d like to aggregate the DataFrame based on json data. Here’s some sample data:

data = {
    'id': [1, 2, 3],
    'name': ['geo1', 'geo2', 'geo3'],
    'json_data': [
        '{"year": [2000, 2001, 2002], "val": [10, 20, 30]}',
        '{"year": [2000, 2001, 2005], "val": [50, 60, 70]}',
        '{"year": [2000, 2001, 2002], "val": [80, 90, 85]}'
    ]
}

df = pd.DataFrame(data)

I’d like to aggregate by year and calculate the median of val. So, if the data were a column, it would be something like:

dff = df.groupby(['year'], as_index=False).agg({'val':'median'})

print(dff)

year val
2000 50
2001 60
2002 58
2005 70

In case of even #, round up the median. only integer values, no decimals.

Answers

Extract year and val from json_data, then group by year to get the result.

import pandas as pd
import json
data = {
    'id': [1, 2, 3],
    'name': ['geo1', 'geo2', 'geo3'],
    'json_data': [
        '{"year": [2000, 2001, 2002], "val": [10, 20, 30]}',
        '{"year": [2000, 2001, 2005], "val": [50, 60, 70]}',
        '{"year": [2000, 2001, 2002], "val": [80, 90, 85]}'
    ]
}
df = pd.DataFrame(data)
result = (df['json_data']
    .apply(json.loads)
    .apply(lambda x: pd.Series(zip(x['year'], x['val'])))
    .stack()
    .apply(pd.Series)
    .groupby(0)[1]
    .median()
    .apply(lambda x: int(round(x)))
    .reset_index()
    )
result.columns = ['year', 'val']
print(result)

Output

   year  val
0  2000   50
1  2001   60
2  2002   58
3  2005   70

To achieve the desired aggregation and median calculation you will first convert the JSON strings in the json_data column to actual Python dictionaries. Then, explode the dictionaries to create multiple rows for each year and val pair. Lastly, you will group by year and calculate the rounded up median of val:

import pandas as pd
import json
import math

data = {
    'id': [1, 2, 3],
    'name': ['geo1', 'geo2', 'geo3'],
    'json_data': [
        '{"year": [2000, 2001, 2002], "val": [10, 20, 30]}',
        '{"year": [2000, 2001, 2005], "val": [50, 60, 70]}',
        '{"year": [2000, 2001, 2002], "val": [80, 90, 85]}'
    ]
}

df = pd.DataFrame(data)

# Convert JSON strings to dictionaries
df['json_data'] = df['json_data'].apply(json.loads)

# Create an empty DataFrame to store the exploded data
exploded_data = []

# Iterate through rows and explode the data
for _, row in df.iterrows():
    id_value = row['id']
    name_value = row['name']
    json_data = row['json_data']
    for year, val in zip(json_data['year'], json_data['val']):
        exploded_data.append({'id': id_value, 'name': name_value, 'year': year, 'val': val})

exploded_df = pd.DataFrame(exploded_data)

# Use groupby to calculate the median and round up
dff = exploded_df.groupby('year', as_index=False)['val'].median()
dff['val'] = dff['val'].apply(math.ceil)  # Round up median

print(dff.to_string(index=False))

Here is the output:

Convert the elements in json_data from string to dictionary and append them to a dataframe

import json
import numpy as np
import pandas as pd

data = {
    'id': [1, 2, 3],
    'name': ['geo1', 'geo2', 'geo3'],
    'json_data': [
        '{"year": [2000, 2001, 2002], "val": [10, 20, 30]}',
        '{"year": [2000, 2001, 2005], "val": [50, 60, 70]}',
        '{"year": [2000, 2001, 2002], "val": [80, 90, 85]}'
    ]
}

data_list = data.get('json_data')
df_list = list()
for el in range(len(data_list)):
    str_to_dict = json.loads(data.get('json_data')[el])
    df = pd.DataFrame(str_to_dict)
    df_list.append(df)
dfs = pd.concat(df_list)
   

dff = dfs.groupby(['year'], as_index=False).agg({'val':'median'})
dff.val = dff.val.apply(np.ceil).astype(int)
print(dff)

OUTPUT
   year  val
0  2000   50
1  2001   60
2  2002   58
3  2005   70

- VishnudevKrishnadas
- August 26, 2023 at 8:42 am
- 0 votes
0
First convert JSON as string to dictionary and load as pandas dataframe
```
import json

df = pd.DataFrame([json.loads(item) for item in data['json_data']])
```
Then, explode into rows and groupby with the respective operation
```
df.explode(['year', 'val']).groupby('year')['val'].median().round().astype(int)
```
Output
```
year
2000    50
2001    60
2002    58
2005    70
Name: val, dtype: int64
```
Login or Signup to reply.

- mozway
- August 26, 2023 at 8:51 am
- 0 votes
0
Using ast.literal_eval, json_normalize, explode:
```
from ast import literal_eval

out = (pd
       .json_normalize(df['json_data'].apply(literal_eval))
   .explode(['year', 'val'])
   .groupby('year', as_index=False)['val'].median()
 )

print(out)
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.