Azure - how to make dataframe select query generic in pyspark

SwatiB
August 18, 2022
179 views
0 votes
2 Answers

Description :
I have a list of column names which I need.
I want to check if all these columns names are present in dataframe.if some columns are present then use those columns and make a generic code like

Df1=df.select(df[column1],df(column2])

List=[column1,column2,column3,column4] Want to check if columns in list is present and whatever the columns are present in dataframe use it in select query

Answers

You need to do it in an iterative fashion

select_list = ['col1','col2','col3']
df_columns = sparkDF.columns ### ['col1','col2','col5','col7']

final_select_list = []

for col in select_list:
    if col in df_columns:
       final_select_list += [col]

### final_select_list --> ['col1','col2']


sparkDF.select(*final_select_list).show()

- samkart
- August 18, 2022 at 12:41 pm
- 0 votes
0
The other answer(s) work perfectly. But it can also be written in a one liner.
```
# predefined list of all required columns
reqd_cols = ['id', 'dt', 'name', 'phone']

data_sdf. 
    select(*[k for k in data_sdf.columns if k in reqd_cols])
```
The list comprehension within the select() checks if any column from data_sdf dataframe is present in the reqd_cols list and keeps only the ones that are overlapping.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Azure – how to make dataframe select query generic in pyspark

Answers