When copying a file from S3 to AzureBlobStorage, I would like to add the date and time string in addition to the source file name.
In essence, the S3 folder structure looks like this
data/yyyy/mm/dd/files
*yyyy=2019-2022, mm=01-12, dd=01-31
And when copying these to Blob, we want to store them in the following folder structure.
data/year=yyyy/month=mm/day=dd/files
Attached is a picture of the folder structure of the S3 bucket and the folder structure we want to achieve with Blob Storage.
I manually renamed all the photo folders in Blob Storage, but there are thousands of files and it takes time, so I want to do it automatically.
Do I use the "GetMetadata" or "ForEach" activity?
Or use dynamic parameters in the "Copy" activity to set up a sink dataset?
Also, I am not an experienced data engineer and am not familiar with Synapse, so I have no idea how to do this due to my lack of knowledge.
Any help woud be appreciated.
Thanks.
2
Answers
You will first need the file name (use Getmetadata). Then for each filename, append date and time string using functions like concat(). You can also create a variable ‘NewFileName’ and use it to pass as a parameter to the copy activity. Then copy source will have the original file name and sink will have the new file name. Copy activity will be parameterized as you will be passing file name dynamically.
Hope this helps.
Using the
Get Metadata
activity,ForEach
activity, andExecute pipeline
activity get the nested folder structure from the source dataset. Pass the extracted folder structure to the sink dataset dynamically by adding the required string value to the folder structure.Pipeline1:
Get Metadata
activity, get the child items under the container (data/).Pass the child items to the
ForEach
activity to loop each folder.execute pipeline
activity. Create a new pipeline (pipeline2) with 2 parameters in it to hold the source and sink folder structure. Pass the pipeline2 parameter values from pipeline1.Subolder1:
@item().name
Sink_dir1:
@concat('year=',item().name)
Pipeline2:
Pipeline2 parameters:
Get Metadata:
Dataset property – dir:
@pipeline().parameters.SubFolder1
SubFolder2:
@concat(pipeline().parameters.SubFolder1,'/',item().name)
sink_dir2:
@concat(pipeline().parameters.sink_dir1,'/month=',item().name)
Pipeline3:
Dataset property – dir:
@pipeline().parameters.SubFolder2
Pass the child items to ForEach activity. Inside ForEach activity add copy data activity to copy files from source to sink.
Connect the source to the source dataset and pass the directory parameter dynamically by concatenating the parameter value and current child item.
dir:
@concat(pipeline().parameters.SubFolder2,'/',item().name,'/')
Create a sink dataset with dataset parameters to pass the directory path dynamically.
In the sink, pass the directory path dynamically by concatenating the parameter value with the current child item path.
Sink_dir:
@concat(pipeline().parameters.sink_dir2,'/day=',item().name,'/')
Output structure: It creates the folder structure automatically if not available in the sink.