Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Extract values from json data with condition and add them in as new columns

kms
August 26, 2023
209 views
2 votes
3 Answers

I have a Pandas DataFrame with JSON data. I’d like to extract the most recent year and the corresponding val and add them in as new column.

Sample DataFrame:

data = {
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'json_data': [
        '{"year": [2000, 2001, 2002], "val": [10, 20, 30]}',
        '{"year": [2003, 2004, 2005], "val": [50, 60, 70]}',
        '{"year": [2006, 2007, 2008], "val": [80, 90, 85]}'
    ]
}

df = pd.DataFrame(data)

Expected output:

New columns:

Most Recent Year
Most Recent val

For row with id 1, this would the year 2002 and val 30.

Answers

One way to do this is to use a function to figure out the latest values from each JSON string and apply that to the dataframe. In your data it looks like the most recent values are the last ones, in which case you could simply use:

def most_recent_values(row):
    dd = json.loads(row['json_data'])
    return (dd['year'][-1], dd['val'][-1])

If that’s not the case, you’ll have to find the index of the maximum value of the (year, val) tuple and use that:

def most_recent_values(row):
    dd = json.loads(row['json_data'])
    return max(list(zip(dd['year'], dd['val'])))

Then you can add the two new columns with apply

df[['Most Recent Year', 'Most Recent Val']] = df.apply(most_recent_values, axis=1, result_type='expand')

For your sample data, the result of both functions is the same:

   id     name                                          json_data  Most Recent Year  Most Recent Val
0   1    Alice  {"year": [2000, 2001, 2002], "val": [10, 20, 30]}              2002               30
1   2      Bob  {"year": [2003, 2004, 2005], "val": [50, 60, 70]}              2005               70
2   3  Charlie  {"year": [2006, 2007, 2008], "val": [80, 90, 85]}              2008               85

Using map, zip, and max, the solution is straightforward.

df[['Most Recent Year', 'Most Recent val']] = (
    df['json_data']
    .map(json.loads)
    .map(lambda d: list(zip(d['year'], d['val'])))
    .map(lambda tup: list(max(tup)))
    .tolist()
)

Result:

   id     name                                          json_data  Most Recent Year  Most Recent val
0   1    Alice  {"year": [2000, 2001, 2002], "val": [10, 20, 30]}              2002               30
1   2      Bob  {"year": [2003, 2004, 2005], "val": [50, 60, 70]}              2005               70
2   3  Charlie  {"year": [2006, 2007, 2008], "val": [80, 90, 85]}              2008               85

To understand what’s going on, break it into steps:

tuple_series = (df['json_data']
    .map(json.loads)
    .map(lambda d: list(zip(d['year'], d['val']))
)

That’s a Series in which each value is a list-of-tuples:

0    [(2000, 10), (2001, 20), (2002, 30)]
1    [(2003, 50), (2004, 60), (2005, 70)]
2    [(2006, 80), (2007, 90), (2008, 85)]

Then find the max tuple in each list, and convert the whole result to lists-of-lists so the column assignment will work correctly.

(Note that in Python, tuples can be compared to each other, so max will work as expected. It compares the first items, then the second items only if necessary.)

tuple_series.map(lambda tup: list(max(tup))).tolist()

Result:

[[2002, 30], [2005, 70], [2008, 85]]

You can write a custom function to keep the most recent year then use pd.concat to get expected result:

import json

def most_recent(x):
    df = pd.DataFrame(json.loads(x)).set_axis(['Most Recent Year',  'Most Recent val'], axis=1)
    return df.loc[df[['Most Recent Year']].idxmax()]

df1 = pd.concat([*df.pop('json_data').agg(most_recent)]).set_index(df.index)
out = pd.concat([df, df1], axis=1)

Output:

>>> out
   id     name  Most Recent Year  Most Recent val
0   1    Alice              2002               30
1   2      Bob              2005               70
2   3  Charlie              2008               85

Please signup or login to give your own answer.

Click here to cancel reply.