Advanced grouping in domain name regex with Python3 - CPanel

lelemar
June 22, 2019
227 views
0 votes
2 Answers

I have a program written in python3 that should parse several domain names every day and extrapolate data.
Parsed data should serve as input for a search function, for aggregation (statistics and charts) and to save some time to the analyst that uses the program.

Just so you know: I don’t really have the time to study machine learning (which seems to be a pretty good solution here), so I chose to start with regex, that I already use.
I already searched the regex documentation inside and outside StackOverflow and worked on the debugger on regex101 and I still haven’t found a way to do what I need.
Edit (24/6/2019): I mention machine learning because of the reason I need a complex parser, that is automate things as much as possible. It would be useful for making automatic choices like blacklisting, whitelisting, etc.

The parser should consider a few things:

a maximum number of 126 subdomains plus the TLD
each subdomain must not be longer than 64 characters
each subdomain can contain only alphanumeric characters and the - character
each subdomain must not begin or end with the - character
the TLD must not be longer than 64 characters
the TLD must not contain only digits

but I to go a little deeper:

the first string can (optionally) contain a “usage type” like cpanel., mail., webdisk., autodiscover. and so on… (or maybe a symple www.)
the TLD can (optionally) contain a particle like .co, .gov, .edu and so on (.co.uk for example)
the final part of the TLD is not really checked against any list of ccTLD/gTLDs right now and I don’t think it will be in the future

What I thought useful to solve the problem is a regex group for the optional usage type, one for each subdomain and one for the TLD (the optional particle must be inside the TLD group)
With these rules in mind I came up with a solution:

^(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[d]?|wiki|www[d]?.)?([a-zd][a-zd-]{0,62}[a-zd])?((.[a-zd][a-zd-]{0,62}[a-zd]){0,124}?(?P<TLD>(.co|.com|.edu|.net|.org|.gov)?.(?!d+)[a-zd]{1,64})$

The above solution doesn’t return the expected results

I report here a couple of examples:

A couple of strings to parse

without.further.ado.lets.travel.the.forest.com  
www.without.further.ado.lets.travel.the.forest.gov.it

The groups I expect to find

FullMatchwithout.further.ado.lets.travel.the.forest.com
group2without
group3further
group4ado
group5lets
group6travel
group7the
group8forest
groupTLD.com
FullMatchwww.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGEwww.
group2without
group3further
group4ado
group5lets
group6travel
group7the
group8forest
groupTLD.gov.it

The groups I find

FullMatchwithout.further.ado.lets.travel.the.forest.com
group2without
group3.further.ado.lets.travel.the.forest
group4.forest
groupTLD.com
FullMatchwww.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGEwww.
group2without
group3.further.ado.lets.travel.the.forest
group4.forest
groupTLD.gov.it
group6.gov

As you can see from the examples, a couple of particles are found twice and that is not the behavior i sought for, anyway. Any attempt to edit the formula results in unexpeted output.
Any idea about a way to find the expected results?

Answers

- Kamal
- June 24, 2019 at 9:48 am
- 0 votes
0
I don’t know if it is possible to get the output exactly as you asked. I think that with a single pattern it cannot catch results in different groups(group2, group3,..).

I found one way to get almost the result you expect using regex module.
```
match = regex.search(r'^(?:(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[d]?|wiki|www[d]?).)?(?:([a-zd][a-zd-]{0,62}[a-zd]).){0,124}?(?P<TLD>(?:co|com|edu|net|org|gov)?.(?!d+)[a-zd]{1,64})$', 'www.without.further.ado.lets.travel.the.forest.gov.it')
```
Output:
```
match.captures(0)
['www.without.further.ado.lets.travel.the.forest.gov.it']
match.captures[1] or match.captures('USAGE')
['www.']
match.captures(2)
['without', 'further', 'ado', 'lets', 'travel', 'the', 'forest']
match.captures(3) or match.captures('TLD')
['gov.it']
```
Here, to avoid taking . in groups I have added it in non-capturing group like this
```
(?:([a-zd][a-zd-]{0,62}[a-zd]).)
```
Hope it helps.
Login or Signup to reply.

This a simple, well-defined task. There is no fuzzyness, no complexity, no guessing, just a series of easy tests to figure out everything on your checklist. I have no idea how “machine learning” would be appropriate, or helpful. Even regex is completely unnecessary.

I’ve not implemented everything you want to verify, but it’s not hard to fill in the missing bits.

import string

double_tld = ['gov', 'edu', 'co', 'add_others_you_need']

# we'll use this instead of regex to check subdomain validity
valid_sd_characters = string.ascii_letters + string.digits + '-'
valid_trans = str.maketrans('', '', valid_sd_characters)

def is_invalid_sd(sd):
    return sd.translate(valid_trans) != ''

def check_hostname(hostname):
    subdomains = hostname.split('.')

    # each subdomain can contain only alphanumeric characters and
    # the - character
    invalid_parts = list(filter(is_invalid_sd, subdomains))
    # TODO react if there are any invalid parts

    # "the TLD can (optionally) contain a particle like
    # .co, .gov, .edu and so on (.co.uk for example)"
    if subdomains[-2] in double_tld:
        subdomains[-2] += '.' + subdomains[-1]
        subdomains = subdomains[:-1]

    # "a maximum number of 126 subdomains plus the TLD"
    # TODO check list length of subdomains

    # "each subdomain must not begin or end with the - character"
    # "the TLD must not be longer than 64 characters"
    # "the TLD must not contain only digits"
    # TODO write loop, check first and last characters, length, isnumeric

    # TODO return something

Please signup or login to give your own answer.

Click here to cancel reply.

Advanced grouping in domain name regex with Python3 – CPanel

Answers