skip to Main Content

I’m trying to identify/get the most recently added files and accordingly, transfer the information to an Azure SQL Database table. The problem is that the new files are added to subfolders and I don’t know how to search within them. Is it possible to search within the subfolders of the container itself?

enter image description here

@activity('Get ListOfFileNames').output.childItems

varReferenceDateTime - 1900-01-01 00:00:00
varLatestFileName

enter image description here

If this process cannot be achieved with ADF, is there another configuration that I could use?

2

Answers


  1. Chosen as BEST ANSWER

    And this is the problem that occurs at the source activity of the data flow.:

    data flow problem


  2. As your folder structure is nested, Get meta data activity won’t recognize the wild card path for the file names. So, it requires nested pipelines based on the folder level. If the folder level is 2, it requires parent -> pipeline in a loop.

    To avoid this nested pipeline, you can use dataflow. First Get all the file paths list in the folder structure using dataflow. Then pass this list to ForEach with below logic.

    -> Dataflow activity
    -> Foreach
            -> Get meta data activity - for last modified date of file
            -> if activity - check the date
                -> update the file path variable with current file path
                -> update the latest date variable
    -> Copy activity - use the latest file path variable in the source file path
    

    Give your source dataset path till the container like below.

    enter image description here

    Give this dataset to the dataflow, and in dataflow source settings, use the wild card path folder/*/* and add the column filepath.

    If you want, you can also add the Start time(subDays(currentUTC(),1) and End time(currentUTC()) filters for the files but make sure to change the expression as per your time zone.

    enter image description here

    Here whetever might be the file structure, it will merge all the rows of the files and adds a column which has the file path of that particular row like below.

    enter image description here

    Now, use aggregate on this. In the Group by, use column filepath and add any sample column count for the aggregate with expression count(<any column name>).

    enter image description here

    This will give the result like below.

    enter image description here

    Then, use select transformation Rule based transformation, to select only the filepath column like below.

    enter image description here

    In the dataflow sink, use sink cache and select Write to activity output.

    enter image description here

    In pipeline, take the dataflow activity and give the below configurations.

    enter image description here

    If you execute the dataflow activity, it will give the file paths array output like below.

    enter image description here

    Give the below expression to the ForEach.

    @activity('Data flow1').output.runStatus.output.sink1.value
    

    Use the above logic to get the latest file now with @item().filepath in every iteration.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search