I have a response_data_v0 that contains several duplicates of the UUID values that I want to ignore, keeping just the value that contains the oldest date (In other words, I only want the row with the first appearance of that specific UUID)
I built a temporary table using "with" and I filter it by getting "min(uuid)" as a column just to get unique values, then I created a second one that counts those values.
Given that the values of the UUID must be unique (following my logic), I created a validation column "excess_data" to test my hypotheses. All the values of "excess_data" should be = 0
if I am not getting duplicates in the first table given that
count(uuid) = count(distinct uuid)
In this specific case.
BUT, that is not happening, "excess_data" > 0
in all my results.
What am I doing wrong??
with unique_values as (SELECT
min(uuid) as uuid,
url,
day,
month,
year
--response_data
FROM "data_lake"."response_data_v0"
group by url, day, month, year
--order by uuid
)
SELECT
count(uuid) as count_uuids,
count(distinct uuid) as count_unique_uuids,
count(uuid) - count(distinct uuid) as excess_data,
month,
year
FROM unique_values
group by year, month
order by year, month
2
Answers
I would argue that the problem here is the different group clauses used to filter out duplicates and to verify that filtering. Unless business logic of the app creating the data prevents the same
uuid
from appearing on different days the observed behavior pretty much is bound to happen –unique_values
grouping clause includesday
, while the final select – does not. Either addday
to the result query group by clause or remove it fromunique_values
‘s.PrestoDB/Trino have a
min_by
aggregate function that returns the value of a column associated with the minimum value of another column (of the group).min_by(uuid, my_date)
would return the value of theuuid
column of the row with the minimum value ofmy_date
in the group. Just assemble a string from your year, month, day columns and use that in place ofmy_date
.