Hello I have quite a few files that I want to combine using Azure Data Factory.
I’m currently using Union activity in dataflow to combine a few files at once but I have about a thousand files and was wondering if there is a easier way to do this? These files are all placed in the same folder and have the same column headers
I’ve also tried using Copy activity but this messes up my files’ rows in the sink data set so I’m avoiding that for now since I need my rows to be ordered.
2
Answers
The most feasible approach to achieve the requirement of merging multiple files is to use ‘merge files’ copy behavior in copy data activity.
However, as you mentioned as of today, the ADF copy mergeFiles behavior doesn’t have control over the order of files to be merged and also the order of rows to be merged. It considers the copy of the data randomly.
For merging the data in an ordered way , you can consider the workaround of loading the data into a SQL table first by looping through the files one by one in sequential order using foreach. Try the below steps:
In ADF, copy activity or dataflow merge files, the order is random. But if your files are in same folder, you can try the below workaround using dataflow.
These are my sample files:
I took some sample data in every file like this:
As you want the data order to be same as the files order like above, use file name as the sorting column.
In the dataset give the path till container and in the wild card paths of the dataflow give your remaining path like above
folder/*.csv
.You can see the source data preview which is random.
But here, it is not changing the rows order within the individual file. So, if we sort the above data by the
filename
column, we can get the data in the correct order as per the filenames.Use Sort transformation like below.
You can see we got the correct order.
Then use select transformation to remove the extra column
filename
from the above data. Transform this to the sink.Sink file: