How do you iterate over a pandas table and replace nulls with a unique or incremental number? - Postgresql

Nick_Jo
January 17, 2023
262 views
2 votes
3 Answers

I have a pandas dataframe that has most of column Job_Code as unique text, however there are missing records recorded as nulls (ie "NaN"). To push this into a Database with Job_Code as a Primary Key I want to replace these missing records with new unique values. My current df is 265 rows × 41 columns.
For simplicity:

df = pd.DataFrame({'Client':['Microsoft', 'Google', 'Apple', 'StackOverflow', 'PostGres'],
                   'Job_Code': ['J001', 'J002', np.NaN, 'J003', np.NaN]})

My desired outcome is:

df = pd.DataFrame({'Client':['Microsoft', 'Google', 'Apple', 'StackOverflow', 'PostGres'],
                   'Job_Code': ['J001', 'J002', 'tempkey1', 'J003', 'tempkey2']})

My initial code to solve executed, but didn’t work:

c=1
for e in enumerate(df.Job_Code):
     if pd.isnull(e):
        e == "tempkey" + str(c)
         c+1

I was looking and couldn’t find a solution to my problem, though more than once there was a case of "Don’t use iterrows!".

Feel free to comment on any methods I’ve used.

Answers

Chosen as BEST ANSWER
- Nick_Jo
- January 17, 2023 at 6:56 am
- 0 votes
0
In the end I used this and it worked, but I'd accept a better answer, one that isn't iterrows reliant.
```
c=1
for i, row in df.iterrows():
    if pd.isnull(row['Job_Code']):
        df.at[i,'Job_Code'] = "tempkey" + str(c)
        c= c+1
```

(Edit)

You can create mask by test missing values by Series.isna and then in DataFrame.loc add range by RangeIndex with count number of Trues by sum:

m = df['Job_Code'].isna()
df.loc[m, 'Job_Code'] = 'tempkey' + pd.RangeIndex(1,m.sum()+1).astype(str)
print (df)
          Client  Job_Code
0      Microsoft      J001
1         Google      J002
2          Apple  tempkey1
3  StackOverflow      J003
4       PostGres  tempkey2

Performance:

df = pd.DataFrame({'Client':['Microsoft', 'Google', 'Apple', 'StackOverflow', 'PostGres'],
                   'Job_Code': ['J001', 'J002', np.NaN, 'J003', np.NaN]})

#50000 rows
df = pd.concat([df] * 10000, ignore_index=True)
print (df)


In [23]: %%timeit
    ...: c=1
    ...: for i, row in df.iterrows():
    ...:     if pd.isnull(row['Job_Code']):
    ...:         df.at[i,'Job_Code'] = "tempkey" + str(c)
    ...:         c= c+1
    ...:         
4.35 s ± 384 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [24]: %%timeit
    ...: mask = df['Job_Code'].isna()
    ...: tempkey = (df['Job_Code'].groupby(mask).cumcount() + 1).map('tempkey{}'.format)
    ...: df['Job_Code'] = df['Job_Code'].fillna(tempkey)
    ...: 
28.1 ms ± 1.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [25]: %%timeit
    ...: m = df['Job_Code'].isna()
    ...: df.loc[m, 'Job_Code'] = 'tempkey' + pd.RangeIndex(1,m.sum()+1).astype(str)
    ...: 
2.86 ms ± 93.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

try this:

mask = df['Job_Code'].isna()
tempkey = (df['Job_Code'].groupby(mask).cumcount() + 1).map('tempkey{}'.format)
df['Job_Code'] = df['Job_Code'].fillna(tempkey)
print(df)
>>>

    Client          Job_Code
0   Microsoft       J001
1   Google          J002
2   Apple           tempkey1
3   StackOverflow   J003
4   PostGres        tempkey2

Please signup or login to give your own answer.

Click here to cancel reply.

How do you iterate over a pandas table and replace nulls with a unique or incremental number? – Postgresql

Answers