I’m working with a dataset where I need to drop some columns which contain only NULL
values. The issue is that the column names are not consistent or similar, and can change with time. I was wondering if there is a way in ADF to drop a column if all instances are NULL
without having drifted columns?
I have tried unpivoting, removing rows, then re-pivoting, however after I pivot the data back to its original format, I get the following message:
"This drifted column is not in the source schema and therefore can only be referenced with pattern matching expressions"
The drifted columns don’t seem to join on subsequent join functions. I have also tried setting derived columns with regex column patters to make all the drifted columns explicit, however, the byName()
function doesn’t seem to work with the $$
syntax; namely:
toString(byName($$))
Any ideas of how to solve this within Azure Data Factory – Data Flows would be very much appreciated!
2
Answers
If the source column names will change, then you have to use column patterns. When you match columns based on patterns, you can project those into columns using the Select transformation. Use the rule-based mapping option in the Select transformation with true() as the matching expression and $$ as the Name As property like this:
I have used combination of both data factory pipeline activities and dataflow to achieve the requirement.
First, I have taken dataflow to output a file. I have added a new column with all values as
1
so that I can use aggregate on all other rows using this new column to group.I have used collect() to create an array for each of the column where group by is on above created column.
In pipeline2, I have used activities to get columns that are not entirely nulls, create a dynamic schema and then use this schema as mapping and write to a file only the required columns.
First, I have read the file written at the end of dataflow without header (even though the file has header). The dataset looks as shown below:
NOTE: In the copy data activity, the source is the original file.