I have a program written in python3 that should parse several domain names every day and extrapolate data.
Parsed data should serve as input for a search function, for aggregation (statistics and charts) and to save some time to the analyst that uses the program.
Just so you know: I don’t really have the time to study machine learning (which seems to be a pretty good solution here), so I chose to start with regex, that I already use.
I already searched the regex documentation inside and outside StackOverflow and worked on the debugger on regex101 and I still haven’t found a way to do what I need.
Edit (24/6/2019): I mention machine learning because of the reason I need a complex parser, that is automate things as much as possible. It would be useful for making automatic choices like blacklisting, whitelisting, etc.
The parser should consider a few things:
- a maximum number of 126 subdomains plus the TLD
- each subdomain must not be longer than 64 characters
- each subdomain can contain only alphanumeric characters and the
-
character - each subdomain must not begin or end with the
-
character - the TLD must not be longer than 64 characters
- the TLD must not contain only digits
but I to go a little deeper:
- the first string can (optionally) contain a “usage type” like
cpanel.
,mail.
,webdisk.
,autodiscover.
and so on… (or maybe a symplewww.
) - the TLD can (optionally) contain a particle like
.co
,.gov
,.edu
and so on (.co.uk
for example) - the final part of the TLD is not really checked against any list of ccTLD/gTLDs right now and I don’t think it will be in the future
What I thought useful to solve the problem is a regex group for the optional usage type, one for each subdomain and one for the TLD (the optional particle must be inside the TLD group)
With these rules in mind I came up with a solution:
^(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[d]?|wiki|www[d]?.)?([a-zd][a-zd-]{0,62}[a-zd])?((.[a-zd][a-zd-]{0,62}[a-zd]){0,124}?(?P<TLD>(.co|.com|.edu|.net|.org|.gov)?.(?!d+)[a-zd]{1,64})$
The above solution doesn’t return the expected results
I report here a couple of examples:
A couple of strings to parse
without.further.ado.lets.travel.the.forest.com
www.without.further.ado.lets.travel.the.forest.gov.it
The groups I expect to find
- FullMatch
without.further.ado.lets.travel.the.forest.com
group2without
group3further
group4ado
group5lets
group6travel
group7the
group8forest
groupTLD.com
- FullMatch
www.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGEwww.
group2without
group3further
group4ado
group5lets
group6travel
group7the
group8forest
groupTLD.gov.it
The groups I find
- FullMatch
without.further.ado.lets.travel.the.forest.com
group2without
group3.further.ado.lets.travel.the.forest
group4.forest
groupTLD.com
- FullMatch
www.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGEwww.
group2without
group3.further.ado.lets.travel.the.forest
group4.forest
groupTLD.gov.it
group6.gov
As you can see from the examples, a couple of particles are found twice and that is not the behavior i sought for, anyway. Any attempt to edit the formula results in unexpeted output.
Any idea about a way to find the expected results?
2
Answers
I don’t know if it is possible to get the output exactly as you asked. I think that with a single pattern it cannot catch results in different groups(group2, group3,..).
I found one way to get almost the result you expect using regex module.
Output:
Here, to avoid taking
.
in groups I have added it in non-capturing group like thisHope it helps.
This a simple, well-defined task. There is no fuzzyness, no complexity, no guessing, just a series of easy tests to figure out everything on your checklist. I have no idea how “machine learning” would be appropriate, or helpful. Even regex is completely unnecessary.
I’ve not implemented everything you want to verify, but it’s not hard to fill in the missing bits.