I have a Pandas DataFrame with JSON data. I’d like to extract the most recent year
and the corresponding val
and add them in as new column.
Sample DataFrame:
data = {
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'json_data': [
'{"year": [2000, 2001, 2002], "val": [10, 20, 30]}',
'{"year": [2003, 2004, 2005], "val": [50, 60, 70]}',
'{"year": [2006, 2007, 2008], "val": [80, 90, 85]}'
]
}
df = pd.DataFrame(data)
Expected output:
New columns:
Most Recent Year
Most Recent val
For row with id
1, this would the year 2002
and val 30
.
3
Answers
One way to do this is to use a function to figure out the latest values from each JSON string and apply that to the dataframe. In your data it looks like the most recent values are the last ones, in which case you could simply use:
If that’s not the case, you’ll have to find the index of the maximum value of the
(year, val)
tuple and use that:Then you can add the two new columns with
apply
For your sample data, the result of both functions is the same:
Using
map
,zip
, andmax
, the solution is straightforward.Result:
To understand what’s going on, break it into steps:
That’s a
Series
in which each value is a list-of-tuples:Then find the
max
tuple in each list, and convert the whole result to lists-of-lists so the column assignment will work correctly.(Note that in Python, tuples can be compared to each other, so
max
will work as expected. It compares the first items, then the second items only if necessary.)Result:
You can write a custom function to keep the most recent year then use
pd.concat
to get expected result:Output: