skip to Main Content

Hello I have quite a few files that I want to combine using Azure Data Factory.

I’m currently using Union activity in dataflow to combine a few files at once but I have about a thousand files and was wondering if there is a easier way to do this? These files are all placed in the same folder and have the same column headers

I’ve also tried using Copy activity but this messes up my files’ rows in the sink data set so I’m avoiding that for now since I need my rows to be ordered.

2

Answers


  1. The most feasible approach to achieve the requirement of merging multiple files is to use ‘merge files’ copy behavior in copy data activity.

    However, as you mentioned as of today, the ADF copy mergeFiles behavior doesn’t have control over the order of files to be merged and also the order of rows to be merged. It considers the copy of the data randomly.

    For merging the data in an ordered way , you can consider the workaround of loading the data into a SQL table first by looping through the files one by one in sequential order using foreach. Try the below steps:

    • Create a control(metadata/config) table/file which has the list of all the filenames as well as the serial number in which you want to load the files :
    1 file1.csv 
    2 file2.csv
    3 file3.csv .. so on
    
    • Use lookup activity to fetch the filenames from this control table order by the serial number
    • Use foreach activity to loop through the filenames by marking the sequential option as ‘On’
    • Use copy activity to load the data into a sink sql table .
    • Once the data is loaded into the table, use another copy activity outside foreach , to load the data into the target file.
    Login or Signup to reply.
  2. In ADF, copy activity or dataflow merge files, the order is random. But if your files are in same folder, you can try the below workaround using dataflow.

    These are my sample files:

    enter image description here

    I took some sample data in every file like this:

    Id,Name,Age
    1,Rakesh,22
    2,Laddu,22
    3,Virat,34
    

    As you want the data order to be same as the files order like above, use file name as the sorting column.

    enter image description here

    In the dataset give the path till container and in the wild card paths of the dataflow give your remaining path like above folder/*.csv.

    You can see the source data preview which is random.

    enter image description here

    But here, it is not changing the rows order within the individual file. So, if we sort the above data by the filename column, we can get the data in the correct order as per the filenames.

    Use Sort transformation like below.

    enter image description here

    You can see we got the correct order.

    enter image description here

    Then use select transformation to remove the extra column filename from the above data. Transform this to the sink.

    Sink file:

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search