I have limited access to the MySQL database, I just can see a view called customer contains customer_id, name, and their location
id_customer name location
1 Andy Detro.it
2 Ben CALiforNIA
3 Mark uk
4 Niels London123
5 Pierre Paris
And a table called location contain list of city and country of customer location.
id_coutry country id_city city
1 US 1 Detroit
1 US 2 California
2 UK 3 London
2 UK 4 Manchester
I want to clean customer data automatically if there is new data in the database, I mean if in the raw data there is punctuation or number or typo, it will automatically clean and then after that the clean location will search their id_city based on location table, if there is no city similar/match, it will search id_country, and if there is no the id_city/country will be 0. and it will become new table called customer location
id_customer id_city status
1 1 Match
2 2 Match
3 2 Country
4 3 Match
5 0 Unknown
The status is label if the location is from city then it will be Match, if it’s from country it will be Country, if there is no similar name or id_city/country 0 it will be unknown. The location can be city or country so the status will tell it’s match with the city or with the country.
Can someone suggest what I must to do this project, I try to do it with python in jupyter notebook but will it be effective for this case? I really new to this things, sorry if I can’t give enough information and thanks before.
2
Answers
I don’t know if I understood correctly here..
[…] I mean if in the raw data there is punctuation or number or typo, it will automatically clean […]
You need some sort of validation method here, you cannot achieve that directly on the database, you need to handle it in your logic before the rows insertion.
In these cases, the best solution is to prepare a picklist (multiple choice) from which end users can choose the right values.
A free text input will always be error prone.
If the multiple choiceis not an applicable solution in your case, then you need to put in place a list of validation rules but you need to think how to prevent every possible issue.
Example, in your case:
You could use Regex to clean the input
You need to play with regex in those manner, for exampe If you want extract only chars
[a-zA-Z]+
In order to handle the input in a case sensitive way you could use
str.title()
After that all the chars, except the firts one, are coverted to lowercase
The final resulting table is obtainable via MySql query.
You need to JOIN the tables. (here the only common fiels is the name of the city, not the best for the ON cluase, an id would be better)
In order to achieve the derived column ‘Staus’ you could leverage the MySql function CASE.
Example:
This is a very stacked question with lots of steps needed to achieve what you want. So let’s dive straight in!
First, we should read the data frames from your (uncleaned) customer database and your location database:
Now that we have the data stored in proper frames to handle them, we can start to clean the locations in your customer database. There are MANY ways to do so! Your requirements were as follows:
Now let’s tackle the first two issues. We can do this using RegEx for cleaning out punctuations or numbers:
pattern = r"[^a-zA-Zs]"
! With that pattern at hand we first clean the customer location data:For your typo issues there is no one-solution-fits-all. You could use a dictionary for often mismatches. Or review the database and add important ones manually. There are also a few libraries which can calculate the "distance" between the intended and actual word.
A good library (subjective opinion – though no affiliation) is FuzzyWuzzy as it allows you to use different metrics, such as the Levenshtein distance or the Jaccard similarity index!
Note that this is just an example. You may go ahead and read the docs or a good article I found on Medium!
You do need to do this twice for both the
location_df["city"]
and location_df["country"]`. Use an algorithm of your choice (depending on the average data you’re getting) – but as mentioned, with the data you included I cannot conclusively decide for you what’s best to use.Now you can use a threshold value to determine whether a city / country is similar enough to be considered! A radical example: But if you got lots of customers from Iran or Iraq, you may need to adjust the values accordingly 😉
Again, please do this for both the country and city!
Now, lastly, let’s bring together the hard work we’ve done! I now create a new table with three columns:
id_customer
(which are the ids from the first table),id_city
(which is either the city ID or country ID depending on the status) and astatus
(which will displayMatch
= exact city was found,Country
= only the country could be matched andUnknown
= no data found -> in that case the default ID will be 0)!Create the final dataframe:
customer_location_df = customer_df[["id_customer"]].copy()
Now set the id_city as described (as mentioned above – you need to do this for country on your own, I sampled the code for city for you):
Create the
status
column and set it:Lastly, save the
customer_location_df
as a new table in the database:customer_location_df.to_sql("customer_location", db_connection, if_exists="replace")
(careful not to replace your main table if it’s calledcustomer_location
)!