Azure - ADF :: If files in sink are not in source delete them

FrancescoMantovani
June 26, 2024
101 views
0 votes
2 Answers

I stumbled upon a good ADF template called Copy new files only by LastModifiedDate.

This ADF templates allows to sync two Blob Storage but only one way A --> B. The templates works fairly well.

The problem is that when a file is deleted in A it is not removed in B.

How can I make ADF check if a file in B (Sink) is not present in A (Source), delete it.

I found a post about this scenario but it’s for SQL, not files.

But the goal I want to reach is the same.

Answers

- Nandan
- June 20, 2024 at 4:34 pm
- 0 votes
0
You can use get meta data activity across both source and sink to identify the list of files across both source and sink.
Then via filter activity get the delta files present in sink not in source

Items : @activity(‘Get List of Sink Files’).output.childItems
Conditions : @not(contains(activity(‘Get List Of Source Files’).output.childItems,item()))

And then use the Delete activity wherein you can pass the array list to get those deleted

My below blog :
https://datasharkx.wordpress.com/2023/04/18/moving-delta-missing-files-between-source-and-sink-via-azure-data-factory-synapse-pipeline/

shows a sample in similar way but for you the changes would be sink and source reversal and then use delete activity rather than copy

Login or Signup to reply.

- RakeshGovindula
- June 26, 2024 at 3:29 pm
- 0 votes
0
ADF doesn’t have a logic to iterate through subfolders?

You can get all the file paths in all the subfolders using ADF dataflow. You have mentioned that your source and targets are Blob storage, you can try dataflow for this.

Create a csv dataset with below configurations. Give the path till your root container of the source folder and give the unused character as the column delimiter.

Give this dataset to the Dataflow source. In the source settings give the wild card file path */** and give a column name for the file path.

It will union all the file columns and adds another column filepath which will have the row’s file path. As we have given different column delimiter and disabled First Row as header in the dataset, it will only have one file’s column col_0 and another column from the dataflow.

To get all the unique file paths, use aggregate transformation with group by of filepath column and in aggregate section give a sample column with expression count(filepath) like below.

It will give the results like below.

Then use select transformation and remove the extra column count from this. After that, use sink and select sinkcache.

Now, in the pipeline take a dataflow activity and add this dataflow. In the dataflow settings, uncheck the First row only.

Upon executing the pipeline, the dataflow output JSON will contain the required file paths as an array.

You can get this array in the pipeline using below expression after dataflow activity.
```
@activity('Data flow1').output.runStatus.output.<sink_name>.value
```
Do the same with target root container as well with another dataflow activity. Get both array and you can compare those two arrays to get the deleted file paths. Later you can delete files with delete activity as suggested by @Nandan.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Azure – ADF :: If files in sink are not in source delete them

Answers