skip to Main Content

within Azure Data Factory I have a table with 2 columns. I want to remove all duplicate rows and only remain one distinct row.

Currently:

columnA, columnB
a1,b1
a1,b1
a2,b2
a3,b3
a3,b3
a3,b4

After the transformation:

columnA, columnB
a1,b1    
a2,b2
a3,b3
a3,b4

What I did so far: I created a data flow where I use aggregate flow, I then group by columnA and tried to build an expression pattern that kills every non-distinctive row.
But it doesnt work :/ It seems that having to check the "distinctness" on both columns is not easy done. Any ideas? :/

2

Answers


  1. To Cleanse the file of duplicate records via Azure Data Factory / Synapse, we can leverage the similar concept of ROW_NUMBER and PARTITION / OVER BY.

    FLow would be :

    Source >>Aggregate transformation>> FIlter transformation >> Sink

    Sample :
    https://www.c-sharpcorner.com/article/cleansing-files-of-duplicate-records-via-azure-data-factory-synapse/

    Login or Signup to reply.
  2. Follow the below approach to achieve your requirement. But note that this approach won’t preserve the rows order in the target as same it is in the source.

    First use sha2(256,columns()) in the aggregate transformation group by and give any name to this column.

    enter image description here

    Now, in the aggregates section, use first($$)in it like below.

    enter image description here

    It will give the distinct rows like below with the extra column columns.

    enter image description here

    Use Rule based mapping in the select transformation to remove the extra column.

    enter image description here

    Now, it will give the desired result:

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search