In Pyspark, I have 2 dataframe. 1st dataframe say df1 is empty dataframe created from a schema. 2nd dataframe df2 is non-empty dataframe filled from a csv file.
Now I want to merge such that all below scenarios are covered.
- If both dataframes contain same number if columns, merge them.
- If 2nd dataframe contains additional columns than drop those columns
- If 2nd dataframe conatains lesser columns then populate those column with null values.
I tried iterating through the empty dataframe columns’ field property.
schema = some schema of few column
for field in scehma.fields:
if field.name in df2.columns:
final_df = df1.withColumn(field.name, df2[field.name].cast(field.dataType))
2
Answers
Basically, you have an input dataframe with columns A, B and C and a reference schema with columns A, B, D. And you want your output dataframe to match this "reference" schema by dropping column C and adding a "NULL" column D.
Assuming
df_ref
is your reference dataframe anddf
is your csv file.You can achieve the desired merge using the
join
operation in PySpark. Here’s an example code snippet that covers all the scenarios you mentioned: