skip to Main Content

I have a program written in python3 that should parse several domain names every day and extrapolate data.
Parsed data should serve as input for a search function, for aggregation (statistics and charts) and to save some time to the analyst that uses the program.

Just so you know: I don’t really have the time to study machine learning (which seems to be a pretty good solution here), so I chose to start with regex, that I already use.
I already searched the regex documentation inside and outside StackOverflow and worked on the debugger on regex101 and I still haven’t found a way to do what I need.
Edit (24/6/2019): I mention machine learning because of the reason I need a complex parser, that is automate things as much as possible. It would be useful for making automatic choices like blacklisting, whitelisting, etc.

The parser should consider a few things:

  • a maximum number of 126 subdomains plus the TLD
  • each subdomain must not be longer than 64 characters
  • each subdomain can contain only alphanumeric characters and the - character
  • each subdomain must not begin or end with the - character
  • the TLD must not be longer than 64 characters
  • the TLD must not contain only digits

but I to go a little deeper:

  • the first string can (optionally) contain a “usage type” like cpanel., mail., webdisk., autodiscover. and so on… (or maybe a symple www.)
  • the TLD can (optionally) contain a particle like .co, .gov, .edu and so on (.co.uk for example)
  • the final part of the TLD is not really checked against any list of ccTLD/gTLDs right now and I don’t think it will be in the future

What I thought useful to solve the problem is a regex group for the optional usage type, one for each subdomain and one for the TLD (the optional particle must be inside the TLD group)
With these rules in mind I came up with a solution:

^(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[d]?|wiki|www[d]?.)?([a-zd][a-zd-]{0,62}[a-zd])?((.[a-zd][a-zd-]{0,62}[a-zd]){0,124}?(?P<TLD>(.co|.com|.edu|.net|.org|.gov)?.(?!d+)[a-zd]{1,64})$

The above solution doesn’t return the expected results


I report here a couple of examples:

A couple of strings to parse

without.further.ado.lets.travel.the.forest.com  
www.without.further.ado.lets.travel.the.forest.gov.it  

The groups I expect to find

  • FullMatchwithout.further.ado.lets.travel.the.forest.com
    group2without
    group3further
    group4ado
    group5lets
    group6travel
    group7the
    group8forest
    groupTLD.com
  • FullMatchwww.without.further.ado.lets.travel.the.forest.gov.it
    groupUSAGEwww.
    group2without
    group3further
    group4ado
    group5lets
    group6travel
    group7the
    group8forest
    groupTLD.gov.it

The groups I find

  • FullMatchwithout.further.ado.lets.travel.the.forest.com
    group2without
    group3.further.ado.lets.travel.the.forest
    group4.forest
    groupTLD.com
  • FullMatchwww.without.further.ado.lets.travel.the.forest.gov.it
    groupUSAGEwww.
    group2without
    group3.further.ado.lets.travel.the.forest
    group4.forest
    groupTLD.gov.it
    group6.gov

As you can see from the examples, a couple of particles are found twice and that is not the behavior i sought for, anyway. Any attempt to edit the formula results in unexpeted output.
Any idea about a way to find the expected results?

2

Answers


  1. I don’t know if it is possible to get the output exactly as you asked. I think that with a single pattern it cannot catch results in different groups(group2, group3,..).

    I found one way to get almost the result you expect using regex module.

    match = regex.search(r'^(?:(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[d]?|wiki|www[d]?).)?(?:([a-zd][a-zd-]{0,62}[a-zd]).){0,124}?(?P<TLD>(?:co|com|edu|net|org|gov)?.(?!d+)[a-zd]{1,64})$', 'www.without.further.ado.lets.travel.the.forest.gov.it')
    

    Output:

    match.captures(0)
    ['www.without.further.ado.lets.travel.the.forest.gov.it']
    match.captures[1] or match.captures('USAGE')
    ['www.']
    match.captures(2)
    ['without', 'further', 'ado', 'lets', 'travel', 'the', 'forest']
    match.captures(3) or match.captures('TLD')
    ['gov.it']
    

    Here, to avoid taking . in groups I have added it in non-capturing group like this

    (?:([a-zd][a-zd-]{0,62}[a-zd]).)
    

    Hope it helps.

    Login or Signup to reply.
  2. This a simple, well-defined task. There is no fuzzyness, no complexity, no guessing, just a series of easy tests to figure out everything on your checklist. I have no idea how “machine learning” would be appropriate, or helpful. Even regex is completely unnecessary.

    I’ve not implemented everything you want to verify, but it’s not hard to fill in the missing bits.

    import string
    
    double_tld = ['gov', 'edu', 'co', 'add_others_you_need']
    
    # we'll use this instead of regex to check subdomain validity
    valid_sd_characters = string.ascii_letters + string.digits + '-'
    valid_trans = str.maketrans('', '', valid_sd_characters)
    
    def is_invalid_sd(sd):
        return sd.translate(valid_trans) != ''
    
    def check_hostname(hostname):
        subdomains = hostname.split('.')
    
        # each subdomain can contain only alphanumeric characters and
        # the - character
        invalid_parts = list(filter(is_invalid_sd, subdomains))
        # TODO react if there are any invalid parts
    
        # "the TLD can (optionally) contain a particle like
        # .co, .gov, .edu and so on (.co.uk for example)"
        if subdomains[-2] in double_tld:
            subdomains[-2] += '.' + subdomains[-1]
            subdomains = subdomains[:-1]
    
        # "a maximum number of 126 subdomains plus the TLD"
        # TODO check list length of subdomains
    
        # "each subdomain must not begin or end with the - character"
        # "the TLD must not be longer than 64 characters"
        # "the TLD must not contain only digits"
        # TODO write loop, check first and last characters, length, isnumeric
    
        # TODO return something
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search