I have the task of matching two lists of names and addresses – represented in two database tables in PostgreSQL.
Address data can be strings like Otto-Johannsen-Straße 7 or even Otto-Johannsen-Str. 7 Wohnung oben which have to match to Otto-Johannsen-Str. 7. Names can look be things like Antje’s Hus which should match with some probability to Antje or Haus am Meer should match to Hotel Haus am Meer.
So it is about fuzzy matching – but soundex() and even levenshtein() won’t help too much, because it’s part’s of strings hat have to be taken into account. With levenshtein() for instance the best match for Abendsonne was "Undine" – but the better match for my data would have been *Hotel Abendsonne".
I envision to have some probability measure with matches – so my result should be a list of matches with probabilities ideally.
How should I approach this problem – which matching algorithms should I use?
And is this a task I would directly do in PostgreSQL – of is it maybe a better approach to use Python?
2
Answers
In my hands, the two examples you show are tied. Unless the
*
is part of the string and not a typo.However, the similarity method from pg_trgm gives results much more like what you probably want:
And unlike Levenshtein, can support indexes for fast look-up.
The main problem with the first way being that Levenshtein is very sensitive to the strings being unequal lengths, while pg_trgm is somewhat less sensitive to that.
The postgis extension contains some code for standardizing addresses, although I don’t have much experience with using it, or any at all at using it in other languages. It is probably worth a look.
Another approach would be to strip out noise words, like ‘Straße’ and ‘Hotel’, from both sides before doing the comparisons.
If you use a geocoder, you can translate each full-text addresses/name to a coordinate pair (or a set of N most likely coordinate pairs) and do your matching based on that. Internally, geocoders do exactly what you’re trying to do – understand all variations of how the same location can be addressed.
The TIGER geocoder included in PostGIS is US-specific, so it needs some tinkering to get it to work elsewhere. OpenStreetMap offers Nominatim you can self-host. There’s also an entire market of paid APIs: GCloud, AWS, Azure all have one but there are also entire marketplaces of smaller vendors which you can usually try out to some extent with a free tier.
If you decide to build your own, you can benchmark it against those. If you come out on top, consider contributing to the open ones.