skip to Main Content

Given the following json:

Full file here: https://pastebin.com/Hzt9bq2a

{
    "name": "Visma Public",
    "domains": [
      "accountsettings.connect.identity.stagaws.visma.com",
      "admin.stage.vismaonline.com",
      "api.home.stag.visma.com",
      "api.workbox.dk",
      "app.workbox.dk",
      "app.workbox.co.uk",
      "authz.workbox.dk",
      "connect.identity.stagaws.visma.com",
      "eaccounting.stage.vismaonline.com",
      "eaccountingprinting.stage.vismaonline.com",
      "http://myservices-api.stage.vismaonline.com/",
      "identity.stage.vismaonline.com",
      "myservices.stage.vismaonline.com"
    ]
}

How can I transform the data to the below. Which is, to identify the domains in the format of site.SLD.TLD present and then remove the duplication of them. (Not including the subdomains, protocols or paths as illustrated below.)

{
    "name": "Visma Public",
    "domains": [
      "workbox.co.uk",
      "workbox.dk",
      "visma.com",
      "vismaonline.com"
    ]
}

I would like to do so in jq as that is what I’ve used to wrangled the data into this format so far, but at this stage any solution that I can run on Debian (I’m using bash) without any extraneous tooling ideally would be fine.

I’m aware that regex can be used within jq so I assume the best way is to regex out the domain and then pipe to unique however I’m unable to get anything working so far I’m currently trying this version which seems to me to need only the text transformation stage adding in somehow either during the jq process or with a run over with something like awk after the event perhaps:

jq '[.[] | {name: .name, domain: [.domains[]] | unique}]' testfile.json

This appears to be useful: https://github.com/stedolan/jq/issues/537

One solution was offered which does a regex match to extract the last two strings separated by . and call the unique function on that & works up to a point but doesn’t cover site.SLD.TLD that has 2 parts. Like google.co.uk would return only co.uk with this jq for example:

jq '.domains |= (map(capture("(?<x>[[:alpha:]]+).(?<z>[[:alpha:]]+)(.?)$") | join(".")) | unique)'

4

Answers


  1. Judging from your example, you don’t actually want top-level domains (just one component, e.g. ".com"), and you probably don’t really want second-level domains (last two components) either, because some domain registries don’t operate at the TLD level. Given www.foo.com.br, you presumably want to find out about foo.com.br, not com.br.

    To do that, you need to consult the Public Suffix List. The file format isn’t too complicated, but it has support for wildcards and exceptions. I dare say that jq isn’t the ideal language to use here — pick one that has a URL-parsing module (for extracting hostnames) and an existing Public Suffix List module (for extracting the domain parts from those hostnames).

    Login or Signup to reply.
  2. May I offer you achieving the same query with jtc: the same could be achieved in other languages (and of course in jq) – the query is mostly how to come up with the regex to satisfy your ask:

    bash $ <file.json jtc -w'<domains>l:>((?:[a-z0-9]+.)?[a-z0-9]+.[a-z0-9]+)[^.]*$<R:' -u'{{$1}}' /
                          -ppw'<domains>l:><q:' -w'[domains]:<[]>j:' -w'<name>l:'
    {
       "domains": [
          "stagaws.visma.com",
          "stage.vismaonline.com",
          "stag.visma.com",
          "api.workbox.dk",
          "app.workbox.dk",
          "workbox.co.uk",
          "authz.workbox.dk"
       ],
       "name": "Visma Public"
    }
    bash $ 
    

    Note: it does extract only DOMAIN.TLD, as per your ask. If you like to extract DOMAIN.SLD.TLD, then the task becomes a bit less trivial.

    Update:

    Modified solution as per the comment: extract domain.sld.tld where 3 or more levels and domain.tld where there’s only 2

    PS. I’m the creator of the jtc – JSON processing utility. This disclaimer is SO requirement.

    Login or Signup to reply.
  3. A programming language is much more expressive than jq.

    Try the following snippet with python3.

    import json
    import pprint
    import urllib.request
    from urllib.parse import urlparse
    import os
    
    def get_tlds():
        f = urllib.request.urlopen("https://publicsuffix.org/list/effective_tld_names.dat")
        content = f.read()
        lines = content.decode('utf-8').split("n")
        # remove comments
        tlds = [line for line in lines if not line.startswith("//") and not line == ""]
        return tlds
    
    def extract_domain(url, tlds):
        # get domain
        url = url.replace("http://", "").replace("https://", "")
        url = url.split("/")[0]
        # get tld/sld
        parts = url.split(".")
        suffix1 = parts[-1]
        sld1 = parts[-2]
        if len(parts) > 2:
            suffix2 = ".".join(parts[-2:])
            sld2 = parts[-3]
        else:
            suffix2 = suffix1
            sld2 = sld1
        # try the longger first
        if suffix2 in tlds:
            tld = suffix2
            sld = sld2
        else:
            tld = suffix1
            sld = sld1
        return sld + "." + tld
    
    def clean(site, tlds):
        site["domains"] = list(set([extract_domain(url, tlds) for url in site["domains"]]))
        return site
    
    if __name__ == "__main__":
        filename = "Hzt9bq2a.json"
    
        cache_path = "tlds.json"
        if os.path.exists(cache_path):
            with open(cache_path, "r") as f:
                tlds = json.load(f)
        else:
            tlds = get_tlds()
            with open(cache_path, "w") as f:
                json.dump(tlds, f)
        
        with open(filename) as f:
            d = json.load(f)
            d = [clean(site, tlds) for site in d]
            pprint.pprint(d)
            with open("clean.json", "w") as f:
                json.dump(d, f)
    
    
    Login or Signup to reply.
  4. One of the solutions presented on this page offers that:

    A programming language is much more expressive than jq.

    It may therefore be worthwhile pointing out that jq is an expressive, Turing-complete programming language, and that it would be as straightforward (and as tedious) to capture all the intricacies of the "Public Suffix List" using jq as any other programming language that does not already provide support for this list.

    It may be useful to illustrate an approach to the problem that passes the (revised) test presented in the Q. This approach could easily be extended in any one of a number of ways:

    def extract:
      sub("^[^:]*://";"")
      | sub("/.*$";"")
      | split(".")
      | (if (.[-1]|length) == 2 and (.[-2]|length) <= 3
         then -3 else -2 end) as $ix
      | .[$ix : ]
      | join(".") ;
    
    {name, domain: (.domains | map(extract) | unique)}
    

    Output

    {
      "name": "Visma Public",
      "domain": [
        "visma.com",
        "vismaonline.com",
        "workbox.co.uk",
        "workbox.dk"
      ]
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search