skip to Main Content

I have a dataframe which has a structure like below.
id, name, value, employee_0_salary, employee_0_skills_0_id, employee_0_skills_1_id, employee_1_salary, employee_1_skills_0_id so on..

Basically formed from a flattened json. But the number of columns exceed 1024. I couldn’t copy the structure of json here but its a nested json with lists, arrays and dictionaries.

Now I want to split the dataframe so that the id column has duplicates but empoyee details comes as columns.

id salary skills_id

How to do it in pandas?

TIA

2

Answers


  1. try this

    import pandas as pd
    
    df = pd.DataFrame({
        'id': [1, 2],
        'employee_0_salary': [30000, 35000],
        'employee_0_skills_0_id': [101, 102],
        'employee_0_skills_1_id': [103, 104],
        'employee_1_salary': [32000, 36000],
        'employee_1_skills_0_id': [105, 106],
        # Add other columns...
    })
    
    
    reshaped_data = []
    
    for employee in range(2):  # Update this based on the actual number of employees
        for skill in range(2):  # Update this based on the actual number of skills per employee
            salary_col = f'employee_{employee}_salary'
            skill_col = f'employee_{employee}_skills_{skill}_id'
    
            temp_df = df[['id', salary_col, skill_col]].copy()
    
            temp_df.rename(columns={salary_col: 'salary', skill_col: 'skills_id'}, inplace=True)
    
            reshaped_data.append(temp_df)
    
    final_df = pd.concat(reshaped_data)
    
    final_df = final_df.drop_duplicates().reset_index(drop=True)
    
    print(final_df)
    
    Login or Signup to reply.
  2. maybe melt() does what you want?

    df = pd.DataFrame({
        'id': [1, 2],
        'name': ['joe','sue'],
        'employee_0_salary': [30000, 35000],
        'employee_0_skills_0_id': [101, 102],
        'employee_0_skills_1_id': [103, 104],
        'employee_1_salary': [32000, 36000],
        'employee_1_skills_0_id': [105, 106],
        # Add other columns...
    })
    
    print(df)
       id name  employee_0_salary  employee_0_skills_0_id  employee_0_skills_1_id  
    0   1  joe              30000                     101                     103   
    1   2  sue              35000                     102                     104   
    
    print(df.melt(id_vars=['id','name']))
       id name                variable  value
    0   1  joe       employee_0_salary  30000
    1   2  sue       employee_0_salary  35000
    2   1  joe  employee_0_skills_0_id    101
    3   2  sue  employee_0_skills_0_id    102
    4   1  joe  employee_0_skills_1_id    103
    5   2  sue  employee_0_skills_1_id    104
    6   1  joe       employee_1_salary  32000
    7   2  sue       employee_1_salary  36000
    8   1  joe  employee_1_skills_0_id    105
    9   2  sue  employee_1_skills_0_id    106
    

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search