I have a large dataset of URLs that I query in AWS Athena (using PrestoDB). PrestoDB has built-in functions to extract the host of the URL. I would now like to split the host further into subdomain, domain and top-level domain (TLD).
As far as my understanding goes, this cannot actually be done with regular expressions: "Since there was and remains no algorithmic method of finding the highest level at which a domain may be registered for a particular top-level domain (the policies differ with each registry), the only method is to create a list." (from https://publicsuffix.org/learn/). Packages in Python and R (see here) accordingly use the list of TLDs from publicsuffix.org. To my knowledge, nothing equivalent exists for PrestoDB. Note that the data set is to large to export and process in R or Python.
Theoretically, I could follow the logic of the tldextract
R package, which
- splits hosts wherever there is a
.
(exampleaws.amazon.com
becomes a vector ofc('aws', 'amazon', 'com')
), - then concatenates these components (except the first one) back together in an incremental way (for example, first concatenation would be
amazon.com
, second and last one would becom
- and for each concatenation checks whether there is a match in the TLD list.
Theoretically, I could translate this logic into Presto, but not easy without loops. So is there any existing solution to this?
2
Answers
I came up with my own solution, heavily based on the R package
tldextract
which relies on the Public Suffix List. I explain the solution in detail here. This assumes you have imported the Public Suffix List into Presto asdomains_tld
, with a column calledtld
containing the TLDs.First some toy data:
Next, concatenate the resulting parts, starting from the last:
Next, bring this table into “long” format:
Now, match the combinations to the list of TLDs:
Finally, we extract the domain out of the host!
I just checked out the option with regular expression.
Does this not match to Your usecase or am I missunderstood something?
Give this a shot: