skip to Main Content

I am loading in .gzip files as binary to my raw container, and I am now wondering how to proceed in azure synapse analytics. I would like to get the binary .gzip and move it to a different folder and store it in Parquet format with the following steps.

  1. Transform the .gzip to the json format
  2. Transform the jsons to parquet

I am new to pipelines and not sure when to use copy data vs dataflow etc.

If someone could show the steps with print screens or very clearly it would be highly appreciated!

Thanks,
Anders

2

Answers


  1. As per the Microsoft official document:

    The Binary Dataset can only be used in Copy activity, GetMetadata activity, or Delete activity. We cannot use Binary Dataset in Dataflow activity. Also, when using Binary dataset, the service does not parse file content but treat it as it is.

    Now, even if you you using Binary dataset in Copy activity, you cannot transform it into other format but need to copy it from Binary dataset to binary dataset only.

    Therefore, you need to change your approach and try some programmatical method for your use case.

    Login or Signup to reply.
  2. This is a common pattern we use, especially for larger ZIP files from SFTP which can take hours to download.

    1. First, as you have already done, use a Binary Dataset to load the zip file to your raw container.
    2. Next create a Delimited Dataset to define the delimiter, quotes, header, etc., to read the raw container file. In this Dataset, define the Compression type as "gzip". When used as a Source, Data Factory will unzip/decompress the data on read. [Some notes: defining a schema is optional, and unnecessary if you are merely transforming formats; you can also use this Dataset as a Sink, and Data Factory will GZ/Compress the data on write; if your files are .zip rather than .gz, use ZipDeflate to accomplish the same tasks.]
    3. Finally, use either COPY or Dataflow to transform the data to the desired Sink definition. If you truly wish to convert to JSON and Parquet, you can do both in a single Dataflow by branching the source.
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search