Azure - What are the possible ways I can handle duplicate data in ADX - PhpOut

NIKHILAV
April 4, 2024
83 views
0 votes
2 Answers

There are common ways to handle dup data as described Here

But I’m looking for more options.
Thanks

I have tried the options from here
https://learn.microsoft.com/en-us/azure/data-explorer/dealing-with-duplicates
and was looking into unique hashed row approach or a policy which triggers the function and periodic KQL queries to identify and remove the duplicate data.

Tags: adx azure azure-data-explorer kql

Answers

- BalaJi
- March 28, 2024 at 1:07 pm
- 0 votes
0
But I’m looking for more options

Another approach to handle duplicate data in ADX is by using distinct. It removes the duplicate data from the table.

Below is the sample table TableF is created with duplicate data as shown below.

Use below query with distinct to work with duplicate values. It stores the data in another table newTable
```
let newTable = TableF
| distinct *;
newTable
```
Output:
Login or Signup to reply.

- werner
- April 4, 2024 at 12:49 pm
- 0 votes
0
My experience is that for large datasets (>100.000.000.000) Materialized Views are the best approach. They are providing very good performance, operations is very stable, it is traceable and deterministic. Plus, they can be controlled in a flexible way by using the view’s policies. Materialized Views are described in the documentation that is mentioned above.

It is also possible to use an update policy function which contains a distinct * statement. If you use that my recommendation is to add a "lookback" filter. Example:
```
.create function DeduplicateData(lookback:datetime )
{
let newTable = TableF
| where mytimestamp > ago(7d)
| distinct *;
newTable
}
```
Alternatively, you could use a retention policy with 0D on the raw table.

But to my experience the function approach is outperformed and less stable than using Materialized Views.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.