within Azure Data Factory I have a table with 2 columns. I want to remove all duplicate rows and only remain one distinct row.
Currently:
columnA, columnB
a1,b1
a1,b1
a2,b2
a3,b3
a3,b3
a3,b4
After the transformation:
columnA, columnB
a1,b1
a2,b2
a3,b3
a3,b4
What I did so far: I created a data flow where I use aggregate flow, I then group by columnA and tried to build an expression pattern that kills every non-distinctive row.
But it doesnt work :/ It seems that having to check the "distinctness" on both columns is not easy done. Any ideas? :/
2
Answers
To Cleanse the file of duplicate records via Azure Data Factory / Synapse, we can leverage the similar concept of ROW_NUMBER and PARTITION / OVER BY.
FLow would be :
Source >>Aggregate transformation>> FIlter transformation >> Sink
Sample :
https://www.c-sharpcorner.com/article/cleansing-files-of-duplicate-records-via-azure-data-factory-synapse/
Follow the below approach to achieve your requirement. But note that this approach won’t preserve the rows order in the target as same it is in the source.
First use
sha2(256,columns())
in the aggregate transformation group by and give any name to this column.Now, in the aggregates section, use
first($$)
in it like below.It will give the distinct rows like below with the extra column
columns
.Use Rule based mapping in the select transformation to remove the extra column.
Now, it will give the desired result: