skip to Main Content

There are common ways to handle dup data as described Here

But I’m looking for more options.
Thanks

I have tried the options from here
https://learn.microsoft.com/en-us/azure/data-explorer/dealing-with-duplicates
and was looking into unique hashed row approach or a policy which triggers the function and periodic KQL queries to identify and remove the duplicate data.

2

Answers


  1. But I’m looking for more options

    Another approach to handle duplicate data in ADX is by using distinct. It removes the duplicate data from the table.

    Below is the sample table TableF is created with duplicate data as shown below.
    enter image description here

    Use below query with distinct to work with duplicate values. It stores the data in another table newTable

    let newTable = TableF
    | distinct *;
    newTable
    

    Output:
    enter image description here

    Login or Signup to reply.
  2. My experience is that for large datasets (>100.000.000.000) Materialized Views are the best approach. They are providing very good performance, operations is very stable, it is traceable and deterministic. Plus, they can be controlled in a flexible way by using the view’s policies. Materialized Views are described in the documentation that is mentioned above.

    It is also possible to use an update policy function which contains a distinct * statement. If you use that my recommendation is to add a "lookback" filter. Example:

    .create function DeduplicateData(lookback:datetime )
    {
    let newTable = TableF
    | where mytimestamp > ago(7d)
    | distinct *;
    newTable
    }
    

    Alternatively, you could use a retention policy with 0D on the raw table.

    But to my experience the function approach is outperformed and less stable than using Materialized Views.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search