I stumbled upon a good ADF template called Copy new files only by LastModifiedDate.
This ADF templates allows to sync two Blob Storage but only one way A --> B
. The templates works fairly well.
The problem is that when a file is deleted in A
it is not removed in B
.
How can I make ADF check if a file in B
(Sink) is not present in A
(Source), delete it.
I found a post about this scenario but it’s for SQL, not files.
But the goal I want to reach is the same.
2
Answers
You can use get meta data activity across both source and sink to identify the list of files across both source and sink.
Then via filter activity get the delta files present in sink not in source
Items : @activity(‘Get List of Sink Files’).output.childItems
Conditions : @not(contains(activity(‘Get List Of Source Files’).output.childItems,item()))
And then use the Delete activity wherein you can pass the array list to get those deleted
My below blog :
https://datasharkx.wordpress.com/2023/04/18/moving-delta-missing-files-between-source-and-sink-via-azure-data-factory-synapse-pipeline/
shows a sample in similar way but for you the changes would be sink and source reversal and then use delete activity rather than copy
You can get all the file paths in all the subfolders using ADF dataflow. You have mentioned that your source and targets are Blob storage, you can try dataflow for this.
Create a csv dataset with below configurations. Give the path till your root container of the source folder and give the unused character as the column delimiter.
Give this dataset to the Dataflow source. In the source settings give the wild card file path
*/**
and give a column name for the file path.It will union all the file columns and adds another column
filepath
which will have the row’s file path. As we have given different column delimiter and disabled First Row as header in the dataset, it will only have one file’s columncol_0
and another column from the dataflow.To get all the unique file paths, use aggregate transformation with group by of
filepath
column and in aggregate section give a sample column with expressioncount(filepath)
like below.It will give the results like below.
Then use select transformation and remove the extra column
count
from this. After that, use sink and selectsinkcache
.Now, in the pipeline take a dataflow activity and add this dataflow. In the dataflow settings, uncheck the First row only.
Upon executing the pipeline, the dataflow output JSON will contain the required file paths as an array.
You can get this array in the pipeline using below expression after dataflow activity.
Do the same with target root container as well with another dataflow activity. Get both array and you can compare those two arrays to get the deleted file paths. Later you can delete files with delete activity as suggested by @Nandan.