skip to Main Content

I have a Pandas DataFrame with JSON data. I’d like to extract the most recent year and the corresponding val and add them in as new column.

Sample DataFrame:

data = {
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'json_data': [
        '{"year": [2000, 2001, 2002], "val": [10, 20, 30]}',
        '{"year": [2003, 2004, 2005], "val": [50, 60, 70]}',
        '{"year": [2006, 2007, 2008], "val": [80, 90, 85]}'
    ]
}

df = pd.DataFrame(data)

Expected output:

New columns:

Most Recent Year
Most Recent val

For row with id 1, this would the year 2002 and val 30.

3

Answers


  1. One way to do this is to use a function to figure out the latest values from each JSON string and apply that to the dataframe. In your data it looks like the most recent values are the last ones, in which case you could simply use:

    def most_recent_values(row):
        dd = json.loads(row['json_data'])
        return (dd['year'][-1], dd['val'][-1])
    

    If that’s not the case, you’ll have to find the index of the maximum value of the (year, val) tuple and use that:

    def most_recent_values(row):
        dd = json.loads(row['json_data'])
        return max(list(zip(dd['year'], dd['val'])))
    

    Then you can add the two new columns with apply

    df[['Most Recent Year', 'Most Recent Val']] = df.apply(most_recent_values, axis=1, result_type='expand')
    

    For your sample data, the result of both functions is the same:

       id     name                                          json_data  Most Recent Year  Most Recent Val
    0   1    Alice  {"year": [2000, 2001, 2002], "val": [10, 20, 30]}              2002               30
    1   2      Bob  {"year": [2003, 2004, 2005], "val": [50, 60, 70]}              2005               70
    2   3  Charlie  {"year": [2006, 2007, 2008], "val": [80, 90, 85]}              2008               85
    
    Login or Signup to reply.
  2. Using map, zip, and max, the solution is straightforward.

    df[['Most Recent Year', 'Most Recent val']] = (
        df['json_data']
        .map(json.loads)
        .map(lambda d: list(zip(d['year'], d['val'])))
        .map(lambda tup: list(max(tup)))
        .tolist()
    )
    

    Result:

       id     name                                          json_data  Most Recent Year  Most Recent val
    0   1    Alice  {"year": [2000, 2001, 2002], "val": [10, 20, 30]}              2002               30
    1   2      Bob  {"year": [2003, 2004, 2005], "val": [50, 60, 70]}              2005               70
    2   3  Charlie  {"year": [2006, 2007, 2008], "val": [80, 90, 85]}              2008               85
    

    To understand what’s going on, break it into steps:

    tuple_series = (df['json_data']
        .map(json.loads)
        .map(lambda d: list(zip(d['year'], d['val']))
    )
    

    That’s a Series in which each value is a list-of-tuples:

    0    [(2000, 10), (2001, 20), (2002, 30)]
    1    [(2003, 50), (2004, 60), (2005, 70)]
    2    [(2006, 80), (2007, 90), (2008, 85)]
    

    Then find the max tuple in each list, and convert the whole result to lists-of-lists so the column assignment will work correctly.

    (Note that in Python, tuples can be compared to each other, so max will work as expected. It compares the first items, then the second items only if necessary.)

    tuple_series.map(lambda tup: list(max(tup))).tolist()
    

    Result:

    [[2002, 30], [2005, 70], [2008, 85]]
    
    Login or Signup to reply.
  3. You can write a custom function to keep the most recent year then use pd.concat to get expected result:

    import json
    
    def most_recent(x):
        df = pd.DataFrame(json.loads(x)).set_axis(['Most Recent Year',  'Most Recent val'], axis=1)
        return df.loc[df[['Most Recent Year']].idxmax()]
    
    df1 = pd.concat([*df.pop('json_data').agg(most_recent)]).set_index(df.index)
    out = pd.concat([df, df1], axis=1)
    

    Output:

    >>> out
       id     name  Most Recent Year  Most Recent val
    0   1    Alice              2002               30
    1   2      Bob              2005               70
    2   3  Charlie              2008               85
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search