Combining multiple files in azure data factory using data flow's union activity

pandasandnumpy
October 4, 2023
110 views
0 votes
2 Answers

Hello I have quite a few files that I want to combine using Azure Data Factory.

I’m currently using Union activity in dataflow to combine a few files at once but I have about a thousand files and was wondering if there is a easier way to do this? These files are all placed in the same folder and have the same column headers

I’ve also tried using Copy activity but this messes up my files’ rows in the sink data set so I’m avoiding that for now since I need my rows to be ordered.

Tags: azure azure-data-factory

Answers

- AnnuKumari
- October 4, 2023 at 8:46 am
- 0 votes
0
The most feasible approach to achieve the requirement of merging multiple files is to use ‘merge files’ copy behavior in copy data activity.

However, as you mentioned as of today, the ADF copy mergeFiles behavior doesn’t have control over the order of files to be merged and also the order of rows to be merged. It considers the copy of the data randomly.

For merging the data in an ordered way , you can consider the workaround of loading the data into a SQL table first by looping through the files one by one in sequential order using foreach. Try the below steps:
- Create a control(metadata/config) table/file which has the list of all the filenames as well as the serial number in which you want to load the files :
```
1 file1.csv 
2 file2.csv
3 file3.csv .. so on
```
- Use lookup activity to fetch the filenames from this control table order by the serial number
- Use foreach activity to loop through the filenames by marking the sequential option as ‘On’
- Use copy activity to load the data into a sink sql table .
- Once the data is loaded into the table, use another copy activity outside foreach , to load the data into the target file.
Login or Signup to reply.

- RakeshGovindula
- October 4, 2023 at 9:38 am
- 0 votes
0
In ADF, copy activity or dataflow merge files, the order is random. But if your files are in same folder, you can try the below workaround using dataflow.

These are my sample files:

I took some sample data in every file like this:
```
Id,Name,Age
1,Rakesh,22
2,Laddu,22
3,Virat,34
```
As you want the data order to be same as the files order like above, use file name as the sorting column.

In the dataset give the path till container and in the wild card paths of the dataflow give your remaining path like above folder/*.csv.

You can see the source data preview which is random.

But here, it is not changing the rows order within the individual file. So, if we sort the above data by the filename column, we can get the data in the correct order as per the filenames.

Use Sort transformation like below.

You can see we got the correct order.

Then use select transformation to remove the extra column filename from the above data. Transform this to the sink.

Sink file:
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.