Editing this to reflect addition work:
Situation
I have 2 pandas dataframes of Twitter search tweets API data in which I have a common data key, author_id
.
I’m using the join
method.
Code is:
dfTW08 = dfTW07.join(dfTW04uf, on='author_id', how='left', lsuffix='', rsuffix='4')
Results
When I run that, everything comes out as expected, except that all the other
dataframe (dfTW04uf
) values come in as NaN
. Including the values for the other
dataframe’s author_id
column.
Assessment
I’m not getting any error messages, but have to think it’s something about the datatypes. The other
dataframe is a mix of int64, object, bool, and datetime datatypes. So it seems odd they’d all be unrecognized.
Any suggestions on how to troubleshoot this greatly appreciated.
2
Answers
Couldn't figure out the
NaN
issue usingjoin
, but was able tomerge
the databases with this:callingdf.merge(otherdf, on='author_id', how='left', indicator=True)
Then did
sort_values
anddrop_duplicates
to get the final list I wanted.You can use
merge
instead ofjoin
sincemerge
had everythingjoin
does but with more "power". (anything you can do withjoin
you can do withmerge
)I am assuming the
NaN
is coming up since the results aren’t being discarded when you asked the firstjoin
to use on author ID and then include suffixes fo x an y. When youleft join
withmerge
you are discarding the non matches without any x and y suffixes.