Azure Data Factory - Remove duplicate rows and only remain one unique row

Julian
November 23, 2023
199 views
0 votes
2 Answers

within Azure Data Factory I have a table with 2 columns. I want to remove all duplicate rows and only remain one distinct row.

Currently:

columnA, columnB
a1,b1
a1,b1
a2,b2
a3,b3
a3,b3
a3,b4

After the transformation:

columnA, columnB
a1,b1    
a2,b2
a3,b3
a3,b4

What I did so far: I created a data flow where I use aggregate flow, I then group by columnA and tried to build an expression pattern that kills every non-distinctive row.
But it doesnt work :/ It seems that having to check the "distinctness" on both columns is not easy done. Any ideas? :/

Tags: azure azure-data-factory

Answers

- Nandan
- November 23, 2023 at 8:27 am
- 0 votes
0
To Cleanse the file of duplicate records via Azure Data Factory / Synapse, we can leverage the similar concept of ROW_NUMBER and PARTITION / OVER BY.

FLow would be :

Source >>Aggregate transformation>> FIlter transformation >> Sink

Sample :
https://www.c-sharpcorner.com/article/cleansing-files-of-duplicate-records-via-azure-data-factory-synapse/

Login or Signup to reply.

- RakeshGovindula
- November 23, 2023 at 11:32 am
- 0 votes
0
Follow the below approach to achieve your requirement. But note that this approach won’t preserve the rows order in the target as same it is in the source.

First use sha2(256,columns()) in the aggregate transformation group by and give any name to this column.

Now, in the aggregates section, use first($$)in it like below.

It will give the distinct rows like below with the extra column columns.

Use Rule based mapping in the select transformation to remove the extra column.

Now, it will give the desired result:

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Azure Data Factory – Remove duplicate rows and only remain one unique row

Answers