SQL - Presto: Create a new table that counts values without duplicates having issues (AWS Athena) - Amazon web services

TheDan
October 6, 2022
305 views
1 vote
2 Answers

I have a response_data_v0 that contains several duplicates of the UUID values that I want to ignore, keeping just the value that contains the oldest date (In other words, I only want the row with the first appearance of that specific UUID)

I built a temporary table using "with" and I filter it by getting "min(uuid)" as a column just to get unique values, then I created a second one that counts those values.

Given that the values of the UUID must be unique (following my logic), I created a validation column "excess_data" to test my hypotheses. All the values of "excess_data" should be = 0 if I am not getting duplicates in the first table given that

count(uuid) = count(distinct uuid)

In this specific case.

BUT, that is not happening, "excess_data" > 0 in all my results.

What am I doing wrong??

with unique_values as (SELECT 
    min(uuid) as uuid,
    url,
    day,
    month,
    year
    --response_data
FROM "data_lake"."response_data_v0" 
group by url, day, month, year
--order by uuid
)

SELECT 
    count(uuid) as count_uuids, 
    count(distinct uuid) as count_unique_uuids, 
    count(uuid) - count(distinct uuid) as excess_data,
    month,
    year
FROM unique_values
group by year, month
order by year, month

Answers

- GuruStron
- October 7, 2022 at 1:23 am
- 0 votes
0
I would argue that the problem here is the different group clauses used to filter out duplicates and to verify that filtering. Unless business logic of the app creating the data prevents the same uuid from appearing on different days the observed behavior pretty much is bound to happen – unique_values grouping clause includes day, while the final select – does not. Either add day to the result query group by clause or remove it from unique_values‘s.

Login or Signup to reply.

- theo
- October 8, 2022 at 11:15 am
- 0 votes
0
PrestoDB/Trino have a min_by aggregate function that returns the value of a column associated with the minimum value of another column (of the group). min_by(uuid, my_date) would return the value of the uuid column of the row with the minimum value of my_date in the group. Just assemble a string from your year, month, day columns and use that in place of my_date.

Login or Signup to reply.