skip to Main Content

I stumbled upon a good ADF template called Copy new files only by LastModifiedDate.

enter image description here

This ADF templates allows to sync two Blob Storage but only one way A --> B. The templates works fairly well.

The problem is that when a file is deleted in A it is not removed in B.

How can I make ADF check if a file in B (Sink) is not present in A (Source), delete it.

I found a post about this scenario but it’s for SQL, not files.

But the goal I want to reach is the same.

2

Answers


  1. You can use get meta data activity across both source and sink to identify the list of files across both source and sink.
    Then via filter activity get the delta files present in sink not in source

    Items : @activity(‘Get List of Sink Files’).output.childItems
    Conditions : @not(contains(activity(‘Get List Of Source Files’).output.childItems,item()))

    And then use the Delete activity wherein you can pass the array list to get those deleted

    My below blog :
    https://datasharkx.wordpress.com/2023/04/18/moving-delta-missing-files-between-source-and-sink-via-azure-data-factory-synapse-pipeline/

    shows a sample in similar way but for you the changes would be sink and source reversal and then use delete activity rather than copy

    Login or Signup to reply.
  2. ADF doesn’t have a logic to iterate through subfolders?

    You can get all the file paths in all the subfolders using ADF dataflow. You have mentioned that your source and targets are Blob storage, you can try dataflow for this.

    Create a csv dataset with below configurations. Give the path till your root container of the source folder and give the unused character as the column delimiter.

    enter image description here

    Give this dataset to the Dataflow source. In the source settings give the wild card file path */** and give a column name for the file path.

    enter image description here

    It will union all the file columns and adds another column filepath which will have the row’s file path. As we have given different column delimiter and disabled First Row as header in the dataset, it will only have one file’s column col_0 and another column from the dataflow.

    enter image description here

    To get all the unique file paths, use aggregate transformation with group by of filepath column and in aggregate section give a sample column with expression count(filepath) like below.

    enter image description here

    enter image description here

    It will give the results like below.

    enter image description here

    Then use select transformation and remove the extra column count from this. After that, use sink and select sinkcache.

    enter image description here

    Now, in the pipeline take a dataflow activity and add this dataflow. In the dataflow settings, uncheck the First row only.

    enter image description here

    Upon executing the pipeline, the dataflow output JSON will contain the required file paths as an array.

    enter image description here

    You can get this array in the pipeline using below expression after dataflow activity.

    @activity('Data flow1').output.runStatus.output.<sink_name>.value
    

    Do the same with target root container as well with another dataflow activity. Get both array and you can compare those two arrays to get the deleted file paths. Later you can delete files with delete activity as suggested by @Nandan.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search