skip to Main Content

I hope this is not a duped question. I’ve looking around and can’t find any question like mine, but…

I have a table containing people’s data, and we know there are duplicates. The table contains more than 30k entries and is live growing. The expected max volume is around 100k.
I’m running a process to find out the duplicated entries, BUT, and here is the BUT… the duplicates are "similar" entries, not exact duplicated entries.

The criteria is as follows: we consider it duplicate if they belong to the same group and have a similar name.

Same group is just a join with another table, properly indexed, etc. That part works fine.

Similar name means equal name ignoring case, trimmed and ignoring internal spaces. So " joHn sMith" (initial blank space) is equal to "JOHN SMITH " (intermediate and final spaces) and equal to " john smith " (initial, intermediate and final spaces).

My initial strategy is super slow: I have a view and one stored parsing the view. The stored removes the duped entries one by one, as the duped removal is complex and affects several other tables.

The stored works more or less efficient, but the view is horrible.

The view joins the table with the people to itself and the plan shows that, of course, the problem is in the full scan on the people’s table for each comparison:

strcmp(replace(lcase(trim(s.name)),’ ‘,”),replace(lcase(trim(t.name)),’ ‘,”)) = 0

Note that s. and t. refer to the same people’s table.

Any suggestions?

Much appreciated.

2

Answers


  1. What I would do is this:

    1. Add a new column to hold the canonical name; e.g. the name converted to all-caps with extraneous spaces and other characters removed.
    2. Populate the column.
    3. Add an index for the column.
    4. Deduplicate by reading records in order on the canonical name and merging / otherwise dealing with duplicate records.

    (It is debatable whether you should keep the original column with the non-canonical form of the names.)

    In rough terms, the complexity is as follows:

    1. Is O(1).
    2. Is O(N).
    3. Is O(NlogN).
    4. Is O(N). That is O(1) for each record!

    I think …

    Login or Signup to reply.
  2. It is not clear from your question whether a person is a direct child of a single group, or a member of multiple groups. Assuming the former, I would just:

    WITH dupes AS (
        SELECT id, MIN(id) OVER w AS replace_with_id
        FROM people
        WINDOW w AS (PARTITION BY group_id, LCASE(REPLACE(name, ' ', '')))
    )
    SELECT *
    FROM dupes
    WHERE id <> replace_with_id;
    

    With your projected max row count of only 100k rows, this should be much faster than your non-sargable self-join. Testing with 0.01% dupe rate on a random (ish) set of 300k rows, it returned 34 rows for merging/replacement in ~70ms.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search