Suggestion for Automatically Match Data in MysQL

Arthur
December 12, 2022
173 views
0 votes
2 Answers

I have limited access to the MySQL database, I just can see a view called customer contains customer_id, name, and their location

id_customer  name   location
1            Andy   Detro.it
2            Ben    CALiforNIA
3            Mark   uk
4            Niels  London123
5            Pierre Paris

And a table called location contain list of city and country of customer location.

id_coutry  country  id_city  city
1          US       1        Detroit
1          US       2        California
2          UK       3        London
2          UK       4        Manchester

I want to clean customer data automatically if there is new data in the database, I mean if in the raw data there is punctuation or number or typo, it will automatically clean and then after that the clean location will search their id_city based on location table, if there is no city similar/match, it will search id_country, and if there is no the id_city/country will be 0. and it will become new table called customer location

id_customer   id_city  status
1             1        Match
2             2        Match
3             2        Country
4             3        Match
5             0        Unknown

The status is label if the location is from city then it will be Match, if it’s from country it will be Country, if there is no similar name or id_city/country 0 it will be unknown. The location can be city or country so the status will tell it’s match with the city or with the country.

Can someone suggest what I must to do this project, I try to do it with python in jupyter notebook but will it be effective for this case? I really new to this things, sorry if I can’t give enough information and thanks before.

Answers

- purplenet
- December 15, 2022 at 12:23 pm
- 0 votes
0
I don’t know if I understood correctly here..

[…] I mean if in the raw data there is punctuation or number or typo, it will automatically clean […]

You need some sort of validation method here, you cannot achieve that directly on the database, you need to handle it in your logic before the rows insertion.

In these cases, the best solution is to prepare a picklist (multiple choice) from which end users can choose the right values.

A free text input will always be error prone.

If the multiple choiceis not an applicable solution in your case, then you need to put in place a list of validation rules but you need to think how to prevent every possible issue.

Example, in your case:

You could use Regex to clean the input
```
import re

city = 'Detro.it'
cleanCity= re.sub(r'[^ws]', '', city)

print(cleanCity) // Detroit
```
You need to play with regex in those manner, for exampe If you want extract only chars [a-zA-Z]+

In order to handle the input in a case sensitive way you could use str.title()
After that all the chars, except the firts one, are coverted to lowercase
```
city = "caliFORnia"
cleanCity = city.title()

print(cleanCity) // California
```
The final resulting table is obtainable via MySql query.
You need to JOIN the tables. (here the only common fiels is the name of the city, not the best for the ON cluase, an id would be better)

In order to achieve the derived column ‘Staus’ you could leverage the MySql function CASE.
Example:
```
 SELECT field1,field2,..,
 CASE
     WHEN field1 = field2 THEN "Match"
     ELSE "Unmatch"
 END AS Derived_Col
 FROM table;

 Result:

 field1   field2   Derived_col
 sometxt  sometxt  Match
 another  other    Unmatch
```
Login or Signup to reply.

- JMArnold
- December 15, 2022 at 1:06 pm
- 0 votes
0
This is a very stacked question with lots of steps needed to achieve what you want. So let’s dive straight in!

First, we should read the data frames from your (uncleaned) customer database and your location database:
```
import pandas as pd

customer_df = pd.read_sql("SELECT * FROM customer", db_connection)
location_df = pd.read_sql("SELECT * FROM location", db_connection)
```
Now that we have the data stored in proper frames to handle them, we can start to clean the locations in your customer database. There are MANY ways to do so! Your requirements were as follows:

there is punctuation or number or typo

Now let’s tackle the first two issues. We can do this using RegEx for cleaning out punctuations or numbers: pattern = r"[^a-zA-Zs]"! With that pattern at hand we first clean the customer location data:
```
pattern = r"[^a-zA-Zs]"
customer_df["location"] = customer_df["location"].str.replace(pattern, "")
```
For your typo issues there is no one-solution-fits-all. You could use a dictionary for often mismatches. Or review the database and add important ones manually. There are also a few libraries which can calculate the "distance" between the intended and actual word.

A good library (subjective opinion – though no affiliation) is FuzzyWuzzy as it allows you to use different metrics, such as the Levenshtein distance or the Jaccard similarity index!
```
import fuzzywuzzy
from fuzzywuzzy import process

levenshtein_matches = process.extract(
    customer_df["location"], location_df["city"], limit=1, scorer=fuzzywuzzy.fuzz.token_set_ratio
)
```
Note that this is just an example. You may go ahead and read the docs or a good article I found on Medium!

You do need to do this twice for both the location_df["city"] and location_df["country"]`. Use an algorithm of your choice (depending on the average data you’re getting) – but as mentioned, with the data you included I cannot conclusively decide for you what’s best to use.

Now you can use a threshold value to determine whether a city / country is similar enough to be considered! A radical example: But if you got lots of customers from Iran or Iraq, you may need to adjust the values accordingly 😉
```
customer_df.loc[
    levenshtein_matches["ratio"] > threshold, "id_city"
] = levenshtein_matches["match"]
```
Again, please do this for both the country and city!

Now, lastly, let’s bring together the hard work we’ve done! I now create a new table with three columns: id_customer (which are the ids from the first table), id_city (which is either the city ID or country ID depending on the status) and a status (which will display Match = exact city was found, Country = only the country could be matched and Unknown = no data found -> in that case the default ID will be 0)!

Create the final dataframe: customer_location_df = customer_df[["id_customer"]].copy()

Now set the id_city as described (as mentioned above – you need to do this for country on your own, I sampled the code for city for you):
```
customer_location_df.loc[
    (customer_df["id_city"].notnull()) & (customer_df["ratio"] > threshold), "id_city"
] = customer_df["id_city"]
customer_location_df.loc[
    (customer_df["id_city"].isnull()) & (customer_df["id_country"].notnull()) & (customer_df["ratio"] > threshold), "id_city"
] = customer_df["id_country"]
customer_location_df.loc[
    (customer_df["id_city"].isnull()) & (customer_df["id_country"].isnull()), "id_city"
] = 0
```
Create the status column and set it:
```
customer_location_df.loc[
    (customer_df["id_city"].notnull()) & (customer_df["ratio"] > threshold), "status"
] = "Match"
customer_location_df.loc[
    (customer_df["id_city"].isnull()) & (customer_df["id_country"].notnull()) & (customer_df["ratio"] > threshold), "status"
] = "Country"
customer_location_df.loc[
    (customer_df["id_city"].isnull()) & (customer_df["id_country"].isnull()), "status"
] = "Unknown"
```
Lastly, save the customer_location_df as a new table in the database: customer_location_df.to_sql("customer_location", db_connection, if_exists="replace") (careful not to replace your main table if it’s called customer_location)!
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.