skip to Main Content

I have a large dataset of URLs that I query in AWS Athena (using PrestoDB). PrestoDB has built-in functions to extract the host of the URL. I would now like to split the host further into subdomain, domain and top-level domain (TLD).

As far as my understanding goes, this cannot actually be done with regular expressions: "Since there was and remains no algorithmic method of finding the highest level at which a domain may be registered for a particular top-level domain (the policies differ with each registry), the only method is to create a list." (from https://publicsuffix.org/learn/). Packages in Python and R (see here) accordingly use the list of TLDs from publicsuffix.org. To my knowledge, nothing equivalent exists for PrestoDB. Note that the data set is to large to export and process in R or Python.

Theoretically, I could follow the logic of the tldextract R package, which

  • splits hosts wherever there is a . (example aws.amazon.com becomes a vector of c('aws', 'amazon', 'com')),
  • then concatenates these components (except the first one) back together in an incremental way (for example, first concatenation would be amazon.com, second and last one would be com
  • and for each concatenation checks whether there is a match in the TLD list.

Theoretically, I could translate this logic into Presto, but not easy without loops. So is there any existing solution to this?

2

Answers


  1. Chosen as BEST ANSWER

    I came up with my own solution, heavily based on the R package tldextract which relies on the Public Suffix List. I explain the solution in detail here. This assumes you have imported the Public Suffix List into Presto as domains_tld, with a column called tld containing the TLDs.

    First some toy data:

    CREATE TABLE dat AS
    SELECT url, url_extract_host(url) AS url_host FROM
    (WITH temp AS (
    SELECT * FROM (VALUES
    ('https://subsub.sub.example.co.uk/page?q=1234#abcd'),
    ('https://example.api.gov.uk/page?q=1234#abcd'),
    ('https://example.co.uk/page?q=1234#abcd'),
    ('https://www.example.com/page?q=1234#abcd'),
    ('https://example.com/page?q=1234#abcd')) AS tab (url))
    SELECT url from temp)
    

    Next, concatenate the resulting parts, starting from the last:

    CREATE TABLE dat3 AS
    SELECT url_host, part_1, part_2, part_3, part_4, part_5,
    CASE
    WHEN part_5 IS NOT NULL THEN part_5
    WHEN part_5 IS NULL AND part_4 IS NOT NULL THEN part_4
    WHEN part_4 IS NULL AND part_3 IS NOT NULL THEN part_3
    WHEN part_3 IS NULL AND part_2 IS NOT NULL THEN part_2
    END AS last,
    CASE
    WHEN part_5 IS NOT NULL THEN concat(part_4, '.', part_5)
    WHEN part_5 IS NULL AND part_4 IS NOT NULL THEN concat(part_3, '.', part_4)
    WHEN part_4 IS NULL AND part_3 IS NOT NULL THEN concat(part_2, '.', part_3)
    END AS two_last,
    CASE
    WHEN part_5 IS NOT NULL THEN concat(part_3, '.', part_4, '.', part_5)
    WHEN part_5 IS NULL AND part_4 IS NOT NULL THEN concat(part_2, '.', part_3, '.', part_4)
    END AS three_last,
    CASE
    WHEN part_5 IS NOT NULL THEN concat(part_2, '.', part_3, '.', part_4, '.', part_5)
    END AS four_last
    FROM dat2
    

    Next, bring this table into “long” format:

    CREATE TABLE dat4 AS
    SELECT * FROM
    (WITH temp AS (
    SELECT url_host, array[last, two_last, three_last, four_last] last_combs
    FROM dat3)
    SELECT url_host, pattern FROM temp
    CROSS JOIN UNNEST(last_combs) as t(pattern))
    WHERE pattern IS NOT NULL
    

    Now, match the combinations to the list of TLDs:

    CREATE TABLE dat5 AS
    SELECT url_host, max_by(tld, tld_length) tld FROM
    (SELECT url_host, pattern, d.tld, length(d.tld) tld_length
    FROM dat4 h
    LEFT JOIN domains_tld d ON h.pattern = d.tld
    WHERE d.tld IS NOT NULL)
    GROUP BY url_host
    

    Finally, we extract the domain out of the host!

    SELECT url_host, tld, 
    regexp_extract(url_host, concat(‘p{Alnum}+.’, tld)) AS sld 
    FROM dat5
    

  2. I just checked out the option with regular expression.
    Does this not match to Your usecase or am I missunderstood something?
    Give this a shot:

    SELECT regexp_extract('aws.amazon.com', '(w+.w+)$');
    SELECT regexp_extract('sub.domain.aws.amazon.com', '(w+.w+)$');
    

    Select regExp top level domain

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search