Extract top-level domain in PrestoDB - Amazon web services

broti
October 10, 2022
241 views
0 votes
2 Answers

I have a large dataset of URLs that I query in AWS Athena (using PrestoDB). PrestoDB has built-in functions to extract the host of the URL. I would now like to split the host further into subdomain, domain and top-level domain (TLD).

As far as my understanding goes, this cannot actually be done with regular expressions: "Since there was and remains no algorithmic method of finding the highest level at which a domain may be registered for a particular top-level domain (the policies differ with each registry), the only method is to create a list." (from https://publicsuffix.org/learn/). Packages in Python and R (see here) accordingly use the list of TLDs from publicsuffix.org. To my knowledge, nothing equivalent exists for PrestoDB. Note that the data set is to large to export and process in R or Python.

Theoretically, I could follow the logic of the tldextract R package, which

splits hosts wherever there is a . (example aws.amazon.com becomes a vector of c('aws', 'amazon', 'com')),
then concatenates these components (except the first one) back together in an incremental way (for example, first concatenation would be amazon.com, second and last one would be com
and for each concatenation checks whether there is a match in the TLD list.

Theoretically, I could translate this logic into Presto, but not easy without loops. So is there any existing solution to this?

Answers

Chosen as BEST ANSWER

I came up with my own solution, heavily based on the R package tldextract which relies on the Public Suffix List. I explain the solution in detail here. This assumes you have imported the Public Suffix List into Presto as domains_tld, with a column called tld containing the TLDs.

First some toy data:

CREATE TABLE dat AS
SELECT url, url_extract_host(url) AS url_host FROM
(WITH temp AS (
SELECT * FROM (VALUES
('https://subsub.sub.example.co.uk/page?q=1234#abcd'),
('https://example.api.gov.uk/page?q=1234#abcd'),
('https://example.co.uk/page?q=1234#abcd'),
('https://www.example.com/page?q=1234#abcd'),
('https://example.com/page?q=1234#abcd')) AS tab (url))
SELECT url from temp)

Next, concatenate the resulting parts, starting from the last:

CREATE TABLE dat3 AS
SELECT url_host, part_1, part_2, part_3, part_4, part_5,
CASE
WHEN part_5 IS NOT NULL THEN part_5
WHEN part_5 IS NULL AND part_4 IS NOT NULL THEN part_4
WHEN part_4 IS NULL AND part_3 IS NOT NULL THEN part_3
WHEN part_3 IS NULL AND part_2 IS NOT NULL THEN part_2
END AS last,
CASE
WHEN part_5 IS NOT NULL THEN concat(part_4, '.', part_5)
WHEN part_5 IS NULL AND part_4 IS NOT NULL THEN concat(part_3, '.', part_4)
WHEN part_4 IS NULL AND part_3 IS NOT NULL THEN concat(part_2, '.', part_3)
END AS two_last,
CASE
WHEN part_5 IS NOT NULL THEN concat(part_3, '.', part_4, '.', part_5)
WHEN part_5 IS NULL AND part_4 IS NOT NULL THEN concat(part_2, '.', part_3, '.', part_4)
END AS three_last,
CASE
WHEN part_5 IS NOT NULL THEN concat(part_2, '.', part_3, '.', part_4, '.', part_5)
END AS four_last
FROM dat2

Next, bring this table into “long” format:

CREATE TABLE dat4 AS
SELECT * FROM
(WITH temp AS (
SELECT url_host, array[last, two_last, three_last, four_last] last_combs
FROM dat3)
SELECT url_host, pattern FROM temp
CROSS JOIN UNNEST(last_combs) as t(pattern))
WHERE pattern IS NOT NULL

Now, match the combinations to the list of TLDs:

CREATE TABLE dat5 AS
SELECT url_host, max_by(tld, tld_length) tld FROM
(SELECT url_host, pattern, d.tld, length(d.tld) tld_length
FROM dat4 h
LEFT JOIN domains_tld d ON h.pattern = d.tld
WHERE d.tld IS NOT NULL)
GROUP BY url_host

Finally, we extract the domain out of the host!

SELECT url_host, tld, 
regexp_extract(url_host, concat(‘p{Alnum}+.’, tld)) AS sld 
FROM dat5

(Edit)

- ExceptionNotThrownException
- October 10, 2022 at 5:24 pm
- 0 votes
0
I just checked out the option with regular expression.
Does this not match to Your usecase or am I missunderstood something?
Give this a shot:
```
SELECT regexp_extract('aws.amazon.com', '(w+.w+)$');
SELECT regexp_extract('sub.domain.aws.amazon.com', '(w+.w+)$');
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Extract top-level domain in PrestoDB – Amazon web services

Answers