skip to Main Content

I have a pandas dataframe that has most of column Job_Code as unique text, however there are missing records recorded as nulls (ie "NaN"). To push this into a Database with Job_Code as a Primary Key I want to replace these missing records with new unique values. My current df is 265 rows × 41 columns.
For simplicity:

df = pd.DataFrame({'Client':['Microsoft', 'Google', 'Apple', 'StackOverflow', 'PostGres'],
                   'Job_Code': ['J001', 'J002', np.NaN, 'J003', np.NaN]})

My desired outcome is:

df = pd.DataFrame({'Client':['Microsoft', 'Google', 'Apple', 'StackOverflow', 'PostGres'],
                   'Job_Code': ['J001', 'J002', 'tempkey1', 'J003', 'tempkey2']})

My initial code to solve executed, but didn’t work:

c=1
for e in enumerate(df.Job_Code):
     if pd.isnull(e):
        e == "tempkey" + str(c)
         c+1

I was looking and couldn’t find a solution to my problem, though more than once there was a case of "Don’t use iterrows!".

Feel free to comment on any methods I’ve used.

3

Answers


  1. Chosen as BEST ANSWER

    In the end I used this and it worked, but I'd accept a better answer, one that isn't iterrows reliant.

    c=1
    for i, row in df.iterrows():
        if pd.isnull(row['Job_Code']):
            df.at[i,'Job_Code'] = "tempkey" + str(c)
            c= c+1
    

  2. You can create mask by test missing values by Series.isna and then in DataFrame.loc add range by RangeIndex with count number of Trues by sum:

    m = df['Job_Code'].isna()
    df.loc[m, 'Job_Code'] = 'tempkey' + pd.RangeIndex(1,m.sum()+1).astype(str)
    print (df)
              Client  Job_Code
    0      Microsoft      J001
    1         Google      J002
    2          Apple  tempkey1
    3  StackOverflow      J003
    4       PostGres  tempkey2
    

    Performance:

    df = pd.DataFrame({'Client':['Microsoft', 'Google', 'Apple', 'StackOverflow', 'PostGres'],
                       'Job_Code': ['J001', 'J002', np.NaN, 'J003', np.NaN]})
    
    #50000 rows
    df = pd.concat([df] * 10000, ignore_index=True)
    print (df)
    
    
    In [23]: %%timeit
        ...: c=1
        ...: for i, row in df.iterrows():
        ...:     if pd.isnull(row['Job_Code']):
        ...:         df.at[i,'Job_Code'] = "tempkey" + str(c)
        ...:         c= c+1
        ...:         
    4.35 s ± 384 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [24]: %%timeit
        ...: mask = df['Job_Code'].isna()
        ...: tempkey = (df['Job_Code'].groupby(mask).cumcount() + 1).map('tempkey{}'.format)
        ...: df['Job_Code'] = df['Job_Code'].fillna(tempkey)
        ...: 
    28.1 ms ± 1.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    In [25]: %%timeit
        ...: m = df['Job_Code'].isna()
        ...: df.loc[m, 'Job_Code'] = 'tempkey' + pd.RangeIndex(1,m.sum()+1).astype(str)
        ...: 
    2.86 ms ± 93.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    Login or Signup to reply.
  3. try this:

    mask = df['Job_Code'].isna()
    tempkey = (df['Job_Code'].groupby(mask).cumcount() + 1).map('tempkey{}'.format)
    df['Job_Code'] = df['Job_Code'].fillna(tempkey)
    print(df)
    >>>
    
        Client          Job_Code
    0   Microsoft       J001
    1   Google          J002
    2   Apple           tempkey1
    3   StackOverflow   J003
    4   PostGres        tempkey2
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search