skip to Main Content

I have a reproducible example, toy dataframe:

df = pd.DataFrame({'my_customers':['John','Foo'],'email':['[email protected]','[email protected]'],'other_column':['yes','no']})

print(df)

  my_customers                email other_column
0         John      [email protected]          yes
1          Foo  [email protected]           no

And I apply() a function to the rows, creating a new column inside the function:

def func(row):

    # if this column is 'yes'
    if row['other_column'] == 'yes':

        # create a new column with 'Hello' in it        
        row['new_column'] = 'Hello' 

        # return to df
        return row 

    # otherwise
    else: 

        # just return the row
        return row

I then apply the function to the df, and we can see that the order has been changed. The columns are now in alphabetical order. Is there any way to avoid this? I would like to keep it in the original order.

df = df.apply(func, axis = 1)
print(df)

                 email my_customers new_column other_column
0      [email protected]         John      Hello          yes
1  [email protected]          Foo        NaN           no

Edited for clarification – the above code was too simple

input

df = pd.DataFrame({'my_customers':['John','Foo'],
                   'email':['[email protected]','[email protected]'],
                   'api_status':['data found','no data found'],
                   'api_response':['huge json','huge json']})

  my_customers                email     api_status api_response
0         John      [email protected]     data found    huge json
1          Foo  [email protected]  no data found    huge json

Parsing the api_response. I need to create many new rows in the DF:

def api_parse(row):

    # if we have response data

    if row['api_response'] == huge json:

        # get response for parsing

        response_data = row['api_response']

        """Let's get associated URLS first"""

        # if there's a URL section in the response

        if 'urls' in response_data .keys():

            # get all associated URLS into a list

            urls = extract_values(response_data ['urls'], 'url')

            row['Associated_Urls'] = urls


        """Get a list of jobs"""

        if 'jobs' in response_data .keys():

            # get all associated jobs and organizations into a list

            titles = extract_values(person_data['jobs'], 'title')
            organizations = extract_values(person_data['jobs'], 'organization')

            counter = 1

            # create a new column for each job

            for pair in zip(titles,organizations):

                row['Job'+'_'+str(counter)] = f'Title: {pair[0]}, Organization: {pair[1]}'

                counter +=1


        """Get a list of education"""

        if 'educations' in response_data .keys():

            # get all degrees into list

            degrees = extract_values(response_data ['educations'], 'display')

            counter = 1

            # create a new column for each degree

            for edu in degrees:

                row['education'+'_'+str(counter)] = edu

                counter +=1


        """Get a list of social profiles from URLS we parsed earlier"""

        facebook = [i for i in urls if 'facebook' in i] or [np.nan]
        instagram = [i for i in urls if 'instagram' in i] or [np.nan]
        linkedin = [i for i in urls if 'linkedin' in i] or [np.nan]
        twitter = [i for i in urls if 'twitter' in i] or [np.nan]
        amazon = [i for i in urls if 'amazon' in i] or [np.nan]

        row['facebook'] = facebook
        row['instagram'] = instagram
        row['linkedin'] = linkedin
        row['twitter'] = twitter
        row['amazon'] = amazon

        return row 

    elif row['api_Status'] == 'No Data Found':
        # do nothing
        return row

expected output:

  my_customers                email     api_status api_response job_1 job_2  
0         John      [email protected]     data found    huge json   xyz  xyz2   
1          Foo  [email protected]  no data found    huge json   nan  nan

  education_1  facebook other api info  
0         foo  profile1            etc  
1         nan  nan                 nan

2

Answers


  1. You could adjust the order of columns in your DataFrame after running the apply function. For example:

    df = df.apply(func, axis = 1)
    df = df[['my_customers', 'email', 'other_column', 'new_column']]
    

    To reduce the amount of duplication (i.e. by having to retype all column names), you could get the existing set of columns before calling the apply function:

    columns = list(df.columns)
    df = df.apply(func, axis = 1)
    df = df[columns + ['new_column']]
    

    Update based on the author’s edits to the original question. Whilst I’m not sure if the data structure chosen (storing API results in a Data Frame) is the best option, one simple solution could be to extract the new columns after calling the apply functions.

    # Store the existing columns before calling apply
    existing_columns = list(df.columns)
    
    df = df.apply(func, axis = 1)
    
    all_columns = list(df.columns)
    new_columns = [column for column in all_columns if column not in existing_columns]
    
    df = df[columns + new_columns]
    

    For performance optimisations, you could store the existing columns in a set instead of a list which will yield lookups in constant time due to the hashed nature of a set data structure in Python. This would change existing_columns = list(df.columns) to existing_columns = set(df.columns).


    Finally, as @Parfait very kindly points out in their comment, the code above may raise some depreciation warnings. Using pandas.DataFrame.reindex instead of df = df[columns + new_columns] will make the warnings disappear:

    new_columns_order = [columns + new_columns]
    df = df.reindex(columns=new_columns_order)
    
    Login or Signup to reply.
  2. That occurs because you don’t assign a value to the new column if row["other_column"] != 'yes'. Just try this:

    def func(row):
    
        if row['other_column'] == 'yes':
    
            row['new_column'] = 'Hello' 
            return row 
    
        else: 
    
            row['new_column'] = '' 
            return row
    
    df.apply(func, axis = 1)
    

    You can choose the value for row["new_column"] == 'no' to be whatever. I just left it blank.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search