I have a database of many thousands of companies. The issue I have is that some of these employers are duplicates – however – they don’t have exactly the same name (otherwise this would be quite easy to solve).
So we have companies like ‘Wine Ltd’, ‘wine’, ‘wine holdings ltd’, ‘wine limited’. These companies may not be the same – however, I want to create a table which shows all of these similar companies so I can make the decision myself (and don’t have to go through all of the records).
I am using PostgreSQL
I have already used a query which searches for the first word of a company
e.g
select * from company where name like ‘FIRSTWORD%’
But obviously this only helps me one employer at a time and this will take me many hours.
2
Answers
I’m not 100% sure if I understand your question correct. So if I’m on the wrong track, please add more details to your question and explain the exact logic to find out whether strings are "similar".
As the question currently stands and as I read your requirements, it seems like you think strings are similar if they start with the same string.
In your example, all of those "similar" companies start with the same 4 letters.
So let’s assume this can be used as general rule in the query you are looking for.
In this case, we can use
STRING_AGG
to build a comma-separated list of similar companies andGROUP BY
their first 4 letters. To find the first 4 letters, we can useLEFT
andLOWER
( to ignore their case).So the query could be something like this:
If we want furthermore to exclude such common strings that occur less than 2 times, less than 4 times or any other number, we can add a
HAVING
clause.Here the example to only fetch such "similar" companies where a "group" of companies includes at least 4 companies:
Here we can try out this idea with some sample data: db<>fiddle
It should be clear such a query will not work perfectly because for example a company "cold water" would not be matched with "water cold". To cover all such similarities will be extremely difficult if even possible.
You could probably use the pg_trgm extension.
You will need some way to mark a pairing as already evaluated and found to be actually different, otherwise you will just keep reviewing the same proposed pairings over and over again.