skip to Main Content

I’m new to Python and stuck with simple task which is now looks not simple at all.

So basically the idea is to provide an obfuscated MySQL dump to Dev team.

The dump is full of INSERT statements with client emails and the idea is to replace usernames with some random hashes (random is important because of UNIQ indexes) and to replace domain name as well.

I tried to use re.sub, it does the job but only if there is one regex match per line. Unfortunately, INSERT statements have lots of emails in one line, so I’m wondering am I missing something?

Basically, here is a data sample:

INSERT INTO `users` VALUES ('john','[email protected]',''),('kate','[email protected]',''),
INSERT INTO `users` VALUES ('peter','[email protected]',''),

Current result:

INSERT INTO `users` VALUES ('john','-**5196869426929071700**@example.com',''),('kate','-**5196869426929071700**@example.com',''),
INSERT INTO `users` VALUES ('peter','[email protected]',''),

The desired output:

INSERT INTO `users` VALUES ('john','-**5196869426929071700**@example.com',''),('kate','-**7140769694376981046**@example.com',''),
INSERT INTO `users` VALUES ('peter','[email protected]',''),

My code:

import re

with open("dump.sql", "r") as file:
    lines = file.readlines()

def repl(match):
    output = str(hash(line)) + "@example.com"
    return output

new_lines = []
for line in lines:
    new_lines.append(re.sub('([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,})', repl, line))

with open("dump-processed.sql", "w") as file:
    file.writelines(new_lines)

Any thoughts on this issue are really appreciated, thanks in advance!

2

Answers


  1. What happens between creating the dump and someone seeing the obfuscated emails?

    Here is a thought:

    1. Take the dump
    2. Load the dump into a separate table(s) (or database, or even server)
    3. UPDATE table SET email = MD5(email);
    4. Move that table(s) to its ultimate location.

    What are the goals?

    If the email column needs to be consistent between tables and for the same email to show up as the same hash in different rows, this achieves such.

    For example, ‘[email protected]’ is consistently and uniquely turned into ‘1f9d9a9efc2f523b2f09629444632b5c’.

    If you need to look at the name and domain separately, my approach needs some work. But so does your "example.com";.

    Login or Signup to reply.
  2. You could use mysqldump --tab to dump CSV data files instead of SQL. Then do your obfuscation on the CSV data.

    Importing these files is a little more complex. You’d recreate the tables, then use mysqlimport one table at a time.

    You need to write some script to loop over the tables, but the advantage is that the import would be much faster than importing by running SQL files.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search