I have a reproducible example, toy dataframe:
df = pd.DataFrame({'my_customers':['John','Foo'],'email':['[email protected]','[email protected]'],'other_column':['yes','no']})
print(df)
my_customers email other_column
0 John [email protected] yes
1 Foo [email protected] no
And I apply()
a function to the rows, creating a new column inside the function:
def func(row):
# if this column is 'yes'
if row['other_column'] == 'yes':
# create a new column with 'Hello' in it
row['new_column'] = 'Hello'
# return to df
return row
# otherwise
else:
# just return the row
return row
I then apply the function to the df, and we can see that the order has been changed. The columns are now in alphabetical order. Is there any way to avoid this? I would like to keep it in the original order.
df = df.apply(func, axis = 1)
print(df)
email my_customers new_column other_column
0 [email protected] John Hello yes
1 [email protected] Foo NaN no
Edited for clarification – the above code was too simple
input
df = pd.DataFrame({'my_customers':['John','Foo'],
'email':['[email protected]','[email protected]'],
'api_status':['data found','no data found'],
'api_response':['huge json','huge json']})
my_customers email api_status api_response
0 John [email protected] data found huge json
1 Foo [email protected] no data found huge json
Parsing the api_response. I need to create many new rows in the DF:
def api_parse(row):
# if we have response data
if row['api_response'] == huge json:
# get response for parsing
response_data = row['api_response']
"""Let's get associated URLS first"""
# if there's a URL section in the response
if 'urls' in response_data .keys():
# get all associated URLS into a list
urls = extract_values(response_data ['urls'], 'url')
row['Associated_Urls'] = urls
"""Get a list of jobs"""
if 'jobs' in response_data .keys():
# get all associated jobs and organizations into a list
titles = extract_values(person_data['jobs'], 'title')
organizations = extract_values(person_data['jobs'], 'organization')
counter = 1
# create a new column for each job
for pair in zip(titles,organizations):
row['Job'+'_'+str(counter)] = f'Title: {pair[0]}, Organization: {pair[1]}'
counter +=1
"""Get a list of education"""
if 'educations' in response_data .keys():
# get all degrees into list
degrees = extract_values(response_data ['educations'], 'display')
counter = 1
# create a new column for each degree
for edu in degrees:
row['education'+'_'+str(counter)] = edu
counter +=1
"""Get a list of social profiles from URLS we parsed earlier"""
facebook = [i for i in urls if 'facebook' in i] or [np.nan]
instagram = [i for i in urls if 'instagram' in i] or [np.nan]
linkedin = [i for i in urls if 'linkedin' in i] or [np.nan]
twitter = [i for i in urls if 'twitter' in i] or [np.nan]
amazon = [i for i in urls if 'amazon' in i] or [np.nan]
row['facebook'] = facebook
row['instagram'] = instagram
row['linkedin'] = linkedin
row['twitter'] = twitter
row['amazon'] = amazon
return row
elif row['api_Status'] == 'No Data Found':
# do nothing
return row
expected output:
my_customers email api_status api_response job_1 job_2
0 John [email protected] data found huge json xyz xyz2
1 Foo [email protected] no data found huge json nan nan
education_1 facebook other api info
0 foo profile1 etc
1 nan nan nan
2
Answers
You could adjust the order of columns in your
DataFrame
after running the apply function. For example:To reduce the amount of duplication (i.e. by having to retype all column names), you could get the existing set of columns before calling the apply function:
Update based on the author’s edits to the original question. Whilst I’m not sure if the data structure chosen (storing API results in a Data Frame) is the best option, one simple solution could be to extract the new columns after calling the apply functions.
For performance optimisations, you could store the existing columns in a
set
instead of alist
which will yield lookups in constant time due to the hashed nature of a set data structure in Python. This would changeexisting_columns = list(df.columns)
toexisting_columns = set(df.columns)
.Finally, as @Parfait very kindly points out in their comment, the code above may raise some depreciation warnings. Using
pandas.DataFrame.reindex
instead ofdf = df[columns + new_columns]
will make the warnings disappear:That occurs because you don’t assign a value to the new column if
row["other_column"] != 'yes'
. Just try this:You can choose the value for
row["new_column"] == 'no'
to be whatever. I just left it blank.