skip to Main Content

When copying a file from S3 to AzureBlobStorage, I would like to add the date and time string in addition to the source file name.

In essence, the S3 folder structure looks like this
data/yyyy/mm/dd/files
*yyyy=2019-2022, mm=01-12, dd=01-31

And when copying these to Blob, we want to store them in the following folder structure.
data/year=yyyy/month=mm/day=dd/files

Attached is a picture of the folder structure of the S3 bucket and the folder structure we want to achieve with Blob Storage.

I manually renamed all the photo folders in Blob Storage, but there are thousands of files and it takes time, so I want to do it automatically.

Do I use the "GetMetadata" or "ForEach" activity?
Or use dynamic parameters in the "Copy" activity to set up a sink dataset?

Also, I am not an experienced data engineer and am not familiar with Synapse, so I have no idea how to do this due to my lack of knowledge.

Any help woud be appreciated.
Thanks.
enter image description here

enter image description here

2

Answers


  1. You will first need the file name (use Getmetadata). Then for each filename, append date and time string using functions like concat(). You can also create a variable ‘NewFileName’ and use it to pass as a parameter to the copy activity. Then copy source will have the original file name and sink will have the new file name. Copy activity will be parameterized as you will be passing file name dynamically.

    Hope this helps.

    Login or Signup to reply.
  2. Using the Get Metadata activity, ForEach activity, and Execute pipeline activity get the nested folder structure from the source dataset. Pass the extracted folder structure to the sink dataset dynamically by adding the required string value to the folder structure.

    1. Create a source dataset with the dataset parameter for the directory.

    enter image description here

    Pipeline1:

    1. Using the Get Metadata activity, get the child items under the container (data/).

    enter image description here

    1. Pass the child items to the ForEach activity to loop each folder.

      @activity('get sub folder list_yyyy').output.childItems
      

    enter image description here

    1. Inside ForEach activity, add the execute pipeline activity. Create a new pipeline (pipeline2) with 2 parameters in it to hold the source and sink folder structure. Pass the pipeline2 parameter values from pipeline1.

    Subolder1: @item().name
    Sink_dir1: @concat('year=',item().name)

    enter image description here

    Pipeline2:

    1. In pipeline2, repeat the same processes as pipeline1. Using Get Metadata activity get the child items under the folder (yyyy folder) and pass the child items to ForEach activity.

    Pipeline2 parameters:

    enter image description here

    Get Metadata:

    Dataset property – dir: @pipeline().parameters.SubFolder1

    enter image description here

    1. Inside ForEach activity, add execute pipeline to pass the current item to nested pipeline (pipeline3). Create 2 pipeline parameters inside pipeline3 to hold source and sink structures.

    SubFolder2: @concat(pipeline().parameters.SubFolder1,'/',item().name)

    sink_dir2: @concat(pipeline().parameters.sink_dir1,'/month=',item().name)

    enter image description here

    Pipeline3:

    1. Using the Get Metadata activity get the child items under the source structure.

    Dataset property – dir: @pipeline().parameters.SubFolder2

    enter image description here

    1. Pass the child items to ForEach activity. Inside ForEach activity add copy data activity to copy files from source to sink.

    2. Connect the source to the source dataset and pass the directory parameter dynamically by concatenating the parameter value and current child item.

    dir: @concat(pipeline().parameters.SubFolder2,'/',item().name,'/')

    enter image description here

    1. Create a sink dataset with dataset parameters to pass the directory path dynamically.

    2. In the sink, pass the directory path dynamically by concatenating the parameter value with the current child item path.

    Sink_dir: @concat(pipeline().parameters.sink_dir2,'/day=',item().name,'/')

    enter image description here

    Output structure: It creates the folder structure automatically if not available in the sink.

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search