skip to Main Content

I am trying to create an ADF pipeline that does the following:

Takes in a csv with 2 columns, eg:
Source, Destination
test_container/test.txt, test_container/test_subfolder/test.txt

Essentially I want to copy/move the filepath from the source directory into the Destination directory (Both these directories are in Azure blob storage).

I think there is a way to do this using lookups, but lookups are limited to 5000 rows and my CSV will be larger than that. Any suggestions on how this can be accomplished?

Thanks in advance,

2

Answers


  1. First, if you want to maintain a hierarchical pattern in your data, i recommend using ADLS (Azure Data Lake Storage) this will guarantee a certain structure for your data.

    second, if you have a Folder in Blob Storage and you would like to copy files to it, use Copy Activity, you should define 2 datasets, one for the source and one for the sink.

    check this link : https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview

    enter image description here

    Login or Signup to reply.
  2. This is a complex scenario for Azure Data Factory. Also as you mentioned there are more than 5000 file paths records in your CSV files, it also means same number of Source and Destination paths. So now if you create this architecture in ADF, it will goes like this:

    1. You will use the Lookup activity to read the Source and Destination paths. In that also you can’t read all the paths due to Lookup activity limitation.

    2. Later you will iterate over the records using ForEach activity.

    3. Now you also need to split the path so that you will get container, directory and file names separately to pass the details to Datasets created for Source and Destination location. Once you split the paths, you need to use the Set variable activity to store the Source and Destination container, directory and file names. These variables will be then passed to Datasets dynamically. This is a tricky part as even if a single record is unable to split properly then your pipeline would fail.

    4. If above step completed successfully, then you not need to worry about copy activity. If all the parameters got the expected values under Source and Destination tabs in copy activity it will work properly.

    My suggestion is to use programmatical approach for this. Use python, for example, to read the CSV file using pandas module and iterate over each path and copy the files. This will work fine even if you have 5000+ records.

    You can refer this SO thread which will help you to implement the same programmatically.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search