skip to Main Content

I have a bunch of data saved as json strings in Pandas DataFrame. I’d like to aggregate the DataFrame based on json data. Here’s some sample data:

data = {
    'id': [1, 2, 3],
    'name': ['geo1', 'geo2', 'geo3'],
    'json_data': [
        '{"year": [2000, 2001, 2002], "val": [10, 20, 30]}',
        '{"year": [2000, 2001, 2005], "val": [50, 60, 70]}',
        '{"year": [2000, 2001, 2002], "val": [80, 90, 85]}'
    ]
}

df = pd.DataFrame(data)

I’d like to aggregate by year and calculate the median of val. So, if the data were a column, it would be something like:

dff = df.groupby(['year'], as_index=False).agg({'val':'median'})

print(dff)

year val
2000 50
2001 60
2002 58
2005 70

In case of even #, round up the median. only integer values, no decimals.

5

Answers


  1. Extract year and val from json_data, then group by year to get the result.

    import pandas as pd
    import json
    data = {
        'id': [1, 2, 3],
        'name': ['geo1', 'geo2', 'geo3'],
        'json_data': [
            '{"year": [2000, 2001, 2002], "val": [10, 20, 30]}',
            '{"year": [2000, 2001, 2005], "val": [50, 60, 70]}',
            '{"year": [2000, 2001, 2002], "val": [80, 90, 85]}'
        ]
    }
    df = pd.DataFrame(data)
    result = (df['json_data']
        .apply(json.loads)
        .apply(lambda x: pd.Series(zip(x['year'], x['val'])))
        .stack()
        .apply(pd.Series)
        .groupby(0)[1]
        .median()
        .apply(lambda x: int(round(x)))
        .reset_index()
        )
    result.columns = ['year', 'val']
    print(result)
    

    Output

       year  val
    0  2000   50
    1  2001   60
    2  2002   58
    3  2005   70
    
    Login or Signup to reply.
  2. To achieve the desired aggregation and median calculation you will first convert the JSON strings in the json_data column to actual Python dictionaries. Then, explode the dictionaries to create multiple rows for each year and val pair. Lastly, you will group by year and calculate the rounded up median of val:

    import pandas as pd
    import json
    import math
    
    data = {
        'id': [1, 2, 3],
        'name': ['geo1', 'geo2', 'geo3'],
        'json_data': [
            '{"year": [2000, 2001, 2002], "val": [10, 20, 30]}',
            '{"year": [2000, 2001, 2005], "val": [50, 60, 70]}',
            '{"year": [2000, 2001, 2002], "val": [80, 90, 85]}'
        ]
    }
    
    df = pd.DataFrame(data)
    
    # Convert JSON strings to dictionaries
    df['json_data'] = df['json_data'].apply(json.loads)
    
    # Create an empty DataFrame to store the exploded data
    exploded_data = []
    
    # Iterate through rows and explode the data
    for _, row in df.iterrows():
        id_value = row['id']
        name_value = row['name']
        json_data = row['json_data']
        for year, val in zip(json_data['year'], json_data['val']):
            exploded_data.append({'id': id_value, 'name': name_value, 'year': year, 'val': val})
    
    exploded_df = pd.DataFrame(exploded_data)
    
    # Use groupby to calculate the median and round up
    dff = exploded_df.groupby('year', as_index=False)['val'].median()
    dff['val'] = dff['val'].apply(math.ceil)  # Round up median
    
    print(dff.to_string(index=False))
    

    Here is the output:

    year  val
    2000   50
    2001   60
    2002   58
    2005   70
    
    Login or Signup to reply.
  3. Convert the elements in json_data from string to dictionary and append them to a dataframe


    import json
    import numpy as np
    import pandas as pd
    
    data = {
        'id': [1, 2, 3],
        'name': ['geo1', 'geo2', 'geo3'],
        'json_data': [
            '{"year": [2000, 2001, 2002], "val": [10, 20, 30]}',
            '{"year": [2000, 2001, 2005], "val": [50, 60, 70]}',
            '{"year": [2000, 2001, 2002], "val": [80, 90, 85]}'
        ]
    }
    
    data_list = data.get('json_data')
    df_list = list()
    for el in range(len(data_list)):
        str_to_dict = json.loads(data.get('json_data')[el])
        df = pd.DataFrame(str_to_dict)
        df_list.append(df)
    dfs = pd.concat(df_list)
       
    
    dff = dfs.groupby(['year'], as_index=False).agg({'val':'median'})
    dff.val = dff.val.apply(np.ceil).astype(int)
    print(dff)
    

    OUTPUT
       year  val
    0  2000   50
    1  2001   60
    2  2002   58
    3  2005   70
    
    Login or Signup to reply.
  4. First convert JSON as string to dictionary and load as pandas dataframe

    import json
    
    df = pd.DataFrame([json.loads(item) for item in data['json_data']])
    

    Then, explode into rows and groupby with the respective operation

    df.explode(['year', 'val']).groupby('year')['val'].median().round().astype(int)
    

    Output

    year
    2000    50
    2001    60
    2002    58
    2005    70
    Name: val, dtype: int64
    
    Login or Signup to reply.
  5. Using ast.literal_eval, json_normalize, explode:

    from ast import literal_eval
    
    out = (pd
           .json_normalize(df['json_data'].apply(literal_eval))
       .explode(['year', 'val'])
       .groupby('year', as_index=False)['val'].median()
     )
    
    print(out)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search