Python Pandas extract csv column containing json

ukasz
July 18, 2024
67 views
3 votes
3 Answers

I want to learn Pandas framework, so I find free csv with Euro data from kaggle.com
https://www.kaggle.com/datasets/piterfm/football-soccer-uefa-euro-1960-2024/data
But there’re plenty of columns which looks like this

subset['goals'][1]

"[{'phase': 'FIRST_HALF', 'time': {'minute': 7, 'second': 41}, 'international_name': 'Xavi Simons', 'club_shirt_name': 'Xavi', 'country_code': 'NED', 'national_field_position': 'FORWARD', 'national_jersey_number': '7', 'goal_type': 'SCORED'}, {'phase': 'FIRST_HALF', 'time': {'minute': 18, 'second': 34}, 'international_name': 'Harry Kane', 'club_shirt_name': 'Kane', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '9', 'goal_type': 'PENALTY'}, {'phase': 'SECOND_HALF', 'time': {'injuryMinute': 1, 'minute': 90, 'second': 1}, 'international_name': 'Ollie Watkins', 'club_shirt_name': 'Watkins', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '19', 'goal_type': 'SCORED'}]"

So I’d like to extract this data and manipulate on it
example dataframe

I’ve traied to use this code

import json
stdf = subset['goals'].apply(json.loads)
# stlst = list(stdf)
# stjson = json.dumps(stlst)
# subset.join(pandas.read_json(stjson))

But for stdf = subset['goals'].apply(json.loads) I’m getting the error message
the JSON object must be str, bytes or bytearray, not float

So, I don’t know how to solve this problem.

I guess I have to iterate over the goal column, I’ve tried something, but still the results were not what they should have been.

Answers

Chosen as BEST ANSWER

@Daweo - ok it's helpful but It's not working with NaN values. I've tried iterate over every row but it returned me ValueError.

I will show it in a different perspective.

this is my dataframe

data = [[2036211,'Spain','England','','',''], [2036210,'Netherlands','England',1.0,2.0,"[{'phase': 'FIRST_HALF', 'time': {'minute': 7, 'second': 41}, 'international_name': 'Xavi Simons', 'club_shirt_name': 'Xavi', 'country_code': 'NED', 'national_field_position': 'FORWARD', 'national_jersey_number': '7', 'goal_type': 'SCORED'}, {'phase': 'FIRST_HALF', 'time': {'minute': 18, 'second': 34}, 'international_name': 'Harry Kane', 'club_shirt_name': 'Kane', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '9', 'goal_type': 'PENALTY'}, {'phase': 'SECOND_HALF', 'time': {'injuryMinute': 1, 'minute': 90, 'second': 1}, 'international_name': 'Ollie Watkins', 'club_shirt_name': 'Watkins', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '19', 'goal_type': 'SCORED'}]"]]

df = pd.DataFrame(data, columns=['id_match' ,'home_team' ,'away_team' ,'home_score','away_score','goals'])

How to extract goals column to have dataframe with column id_match, home_team, away_team, home_score, away_score, goals_phase, goals_time etc.

If there are more than one goal, It returns many rows with the same id_match, home_team, away_team, home_score, away_score columns and unique from goals column.

(Edit)

This

"[{'phase': 'FIRST_HALF', 'time': {'minute': 7, 'second': 41}, 'international_name': 'Xavi Simons', 'club_shirt_name': 'Xavi', 'country_code': 'NED', 'national_field_position': 'FORWARD', 'national_jersey_number': '7', 'goal_type': 'SCORED'}, {'phase': 'FIRST_HALF', 'time': {'minute': 18, 'second': 34}, 'international_name': 'Harry Kane', 'club_shirt_name': 'Kane', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '9', 'goal_type': 'PENALTY'}, {'phase': 'SECOND_HALF', 'time': {'injuryMinute': 1, 'minute': 90, 'second': 1}, 'international_name': 'Ollie Watkins', 'club_shirt_name': 'Watkins', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '19', 'goal_type': 'SCORED'}]"

is not JSON, but is valid python list and thus can be loaded using ast.literal_eval that is

import ast
import pandas as pd
data_string = "[{'phase': 'FIRST_HALF', 'time': {'minute': 7, 'second': 41}, 'international_name': 'Xavi Simons', 'club_shirt_name': 'Xavi', 'country_code': 'NED', 'national_field_position': 'FORWARD', 'national_jersey_number': '7', 'goal_type': 'SCORED'}, {'phase': 'FIRST_HALF', 'time': {'minute': 18, 'second': 34}, 'international_name': 'Harry Kane', 'club_shirt_name': 'Kane', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '9', 'goal_type': 'PENALTY'}, {'phase': 'SECOND_HALF', 'time': {'injuryMinute': 1, 'minute': 90, 'second': 1}, 'international_name': 'Ollie Watkins', 'club_shirt_name': 'Watkins', 'country_code': 'ENG', 'national_field_position': 'FORWARD', 'national_jersey_number': '19', 'goal_type': 'SCORED'}]"
data = ast.literal_eval(data_string)
df = pd.DataFrame(data)
print(df)

gives output

         phase  ... goal_type
0   FIRST_HALF  ...    SCORED
1   FIRST_HALF  ...   PENALTY
2  SECOND_HALF  ...    SCORED

[3 rows x 8 columns]

So what you’ll need to do is first convert those string values into a dictionary. Then use explode to convert each if those values of the keys into rows. Then ultimately use json_normalize

import pandas as pd
import ast

file = 'C:/Users/SchvaJ01/Downloads/2024.csv/2024.csv'
df = pd.read_csv(file)


# Convert string representations of lists to actual lists
df['goals'] = df['goals'].fillna('[]').apply(ast.literal_eval)

# Explode the list of dictionaries into separate rows
df_exploded = df.explode('goals')

# Normalize the nested dictionaries
df_flattened = pd.json_normalize(df_exploded['goals'])


print(df_flattened.head(5).to_string())

Output:

print(df_flattened.head(5).to_string())
         phase international_name   club_shirt_name country_code national_field_position national_jersey_number goal_type  time.minute  time.second  time.injuryMinute
0  SECOND_HALF      Nico Williams  WILLIAMS ARTHUER          ESP                 FORWARD                     17    SCORED         47.0         10.0                NaN
1  SECOND_HALF        Cole Palmer            Palmer          ENG              MIDFIELDER                     24    SCORED         73.0          9.0                NaN
2  SECOND_HALF    Mikel Oyarzabal         Oyarzabal          ESP                 FORWARD                     21    SCORED         86.0         56.0                NaN
3   FIRST_HALF        Xavi Simons              Xavi          NED                 FORWARD                      7    SCORED          7.0         41.0                NaN
4   FIRST_HALF         Harry Kane              Kane          ENG                 FORWARD                      9   PENALTY         18.0         34.0                NaN

Please signup or login to give your own answer.

Click here to cancel reply.