I am working on ADF data flow and I am trying to implement a logic for two step sorting in my source data. The requirement is like, I have 1 primary key and 2 date columns :
example- id, date1 and date2
Requirement-
(i) if there is duplicate data with same id in source file then the row which has maximum date1 should be picked else,
(ii) if there is duplicate data with same id and if the date1 is also same for them then the row which has maximum date2 should be picked and sent to output.
I tried giving two sorting one after another in Aggregate stage but currently I am getting random values from both duplicate rows which is wrong.
Can anyone help me to get this requirement? Thank you
2
Answers
Use a Sort transformation to sort the data based on the first condition.
Connect the Sort transformation to a Conditional Split transformation.
In order to get the maximum date1 for the same ids and maximum date2 for same id,date1 combination, you have to first get the max date2 value for the same id and date1. Then check for the maximum date1 for id column. Below is the detailed approach.
Sample input
| id | date1 | date2 |
|—-|————|————-|
| 1 | 2023-01-01 | 2023-01-03 |
| 1 | 2023-01-02 | 2023-01-02 |
| 2 | 2023-01-02 | 2023-01-01 |
| 2 | 2023-01-02 | 2023-01-02 |
The sample input that is taken has three columns: id, date1, and date2.
Use an Aggregate transformation to group the data by the id and date1 columns and calculate the maximum value of date2 for each group. This will ensure that for each id and date1 combination, you get the maximum value of date2. You can use the following expression in the Aggregate transformation:
The output of this transformation will have three columns: id, date1, and date2 (max value of date2).
Use another Aggregate transformation to group the data by the id column and calculate the maximum values of date1 for each id group. This will ensure that for each id, you get the maximum values of date1. You can use the following expression in the Aggregate transformation:
This will ensure that you get the rows that satisfy your two-step sorting logic.