There are common ways to handle dup data as described Here
But I’m looking for more options.
Thanks
I have tried the options from here
https://learn.microsoft.com/en-us/azure/data-explorer/dealing-with-duplicates
and was looking into unique hashed row approach or a policy which triggers the function and periodic KQL queries to identify and remove the duplicate data.
2
Answers
Another approach to handle duplicate data in ADX is by using
distinct
. It removes the duplicate data from the table.Below is the sample table
TableF
is created with duplicate data as shown below.Use below query with
distinct
to work with duplicate values. It stores the data in another tablenewTable
Output:
My experience is that for large datasets (>100.000.000.000) Materialized Views are the best approach. They are providing very good performance, operations is very stable, it is traceable and deterministic. Plus, they can be controlled in a flexible way by using the view’s policies. Materialized Views are described in the documentation that is mentioned above.
It is also possible to use an update policy function which contains a distinct * statement. If you use that my recommendation is to add a "lookback" filter. Example:
Alternatively, you could use a retention policy with 0D on the raw table.
But to my experience the function approach is outperformed and less stable than using Materialized Views.