I hope this is not a duped question. I’ve looking around and can’t find any question like mine, but…
I have a table containing people’s data, and we know there are duplicates. The table contains more than 30k entries and is live growing. The expected max volume is around 100k.
I’m running a process to find out the duplicated entries, BUT, and here is the BUT… the duplicates are "similar" entries, not exact duplicated entries.
The criteria is as follows: we consider it duplicate if they belong to the same group and have a similar name.
Same group is just a join with another table, properly indexed, etc. That part works fine.
Similar name means equal name ignoring case, trimmed and ignoring internal spaces. So " joHn sMith" (initial blank space) is equal to "JOHN SMITH " (intermediate and final spaces) and equal to " john smith " (initial, intermediate and final spaces).
My initial strategy is super slow: I have a view and one stored parsing the view. The stored removes the duped entries one by one, as the duped removal is complex and affects several other tables.
The stored works more or less efficient, but the view is horrible.
The view joins the table with the people to itself and the plan shows that, of course, the problem is in the full scan on the people’s table for each comparison:
strcmp(replace(lcase(trim(s
.name
)),’ ‘,”),replace(lcase(trim(t
.name
)),’ ‘,”)) = 0
Note that s. and t. refer to the same people’s table.
Any suggestions?
Much appreciated.
2
Answers
What I would do is this:
(It is debatable whether you should keep the original column with the non-canonical form of the names.)
In rough terms, the complexity is as follows:
I think …
It is not clear from your question whether a person is a direct child of a single group, or a member of multiple groups. Assuming the former, I would just:
With your projected max row count of only 100k rows, this should be much faster than your non-sargable self-join. Testing with 0.01% dupe rate on a random (ish) set of 300k rows, it returned 34 rows for merging/replacement in ~70ms.