jq - How to extract domains and remove duplicates - Debian

JeffreyShran
July 7, 2020
190 views
2 votes
4 Answers

Given the following json:

Full file here: https://pastebin.com/Hzt9bq2a

{
    "name": "Visma Public",
    "domains": [
      "accountsettings.connect.identity.stagaws.visma.com",
      "admin.stage.vismaonline.com",
      "api.home.stag.visma.com",
      "api.workbox.dk",
      "app.workbox.dk",
      "app.workbox.co.uk",
      "authz.workbox.dk",
      "connect.identity.stagaws.visma.com",
      "eaccounting.stage.vismaonline.com",
      "eaccountingprinting.stage.vismaonline.com",
      "http://myservices-api.stage.vismaonline.com/",
      "identity.stage.vismaonline.com",
      "myservices.stage.vismaonline.com"
    ]
}

How can I transform the data to the below. Which is, to identify the domains in the format of site.SLD.TLD present and then remove the duplication of them. (Not including the subdomains, protocols or paths as illustrated below.)

{
    "name": "Visma Public",
    "domains": [
      "workbox.co.uk",
      "workbox.dk",
      "visma.com",
      "vismaonline.com"
    ]
}

I would like to do so in jq as that is what I’ve used to wrangled the data into this format so far, but at this stage any solution that I can run on Debian (I’m using bash) without any extraneous tooling ideally would be fine.

I’m aware that regex can be used within jq so I assume the best way is to regex out the domain and then pipe to unique however I’m unable to get anything working so far I’m currently trying this version which seems to me to need only the text transformation stage adding in somehow either during the jq process or with a run over with something like awk after the event perhaps:

jq '[.[] | {name: .name, domain: [.domains[]] | unique}]' testfile.json

This appears to be useful: https://github.com/stedolan/jq/issues/537

One solution was offered which does a regex match to extract the last two strings separated by . and call the unique function on that & works up to a point but doesn’t cover site.SLD.TLD that has 2 parts. Like google.co.uk would return only co.uk with this jq for example:

jq '.domains |= (map(capture("(?<x>[[:alpha:]]+).(?<z>[[:alpha:]]+)(.?)$") | join(".")) | unique)'

Tags: jq json

Answers

- hobbs
- July 7, 2020 at 4:56 pm
- 0 votes
0
Judging from your example, you don’t actually want top-level domains (just one component, e.g. ".com"), and you probably don’t really want second-level domains (last two components) either, because some domain registries don’t operate at the TLD level. Given www.foo.com.br, you presumably want to find out about foo.com.br, not com.br.

To do that, you need to consult the Public Suffix List. The file format isn’t too complicated, but it has support for wildcards and exceptions. I dare say that jq isn’t the ideal language to use here — pick one that has a URL-parsing module (for extracting hostnames) and an existing Public Suffix List module (for extracting the domain parts from those hostnames).

Login or Signup to reply.

- Dmitry
- July 8, 2020 at 3:30 pm
- 0 votes
0
May I offer you achieving the same query with jtc: the same could be achieved in other languages (and of course in jq) – the query is mostly how to come up with the regex to satisfy your ask:
```
bash $ <file.json jtc -w'<domains>l:>((?:[a-z0-9]+.)?[a-z0-9]+.[a-z0-9]+)[^.]*$<R:' -u'{{$1}}' /
                      -ppw'<domains>l:><q:' -w'[domains]:<[]>j:' -w'<name>l:'
{
   "domains": [
      "stagaws.visma.com",
      "stage.vismaonline.com",
      "stag.visma.com",
      "api.workbox.dk",
      "app.workbox.dk",
      "workbox.co.uk",
      "authz.workbox.dk"
   ],
   "name": "Visma Public"
}
bash $ 
```
Note: it does extract only DOMAIN.TLD, as per your ask. If you like to extract DOMAIN.SLD.TLD, then the task becomes a bit less trivial.

Update:

Modified solution as per the comment: extract domain.sld.tld where 3 or more levels and domain.tld where there’s only 2

PS. I’m the creator of the jtc – JSON processing utility. This disclaimer is SO requirement.
Login or Signup to reply.

A programming language is much more expressive than jq.

Try the following snippet with python3.

import json
import pprint
import urllib.request
from urllib.parse import urlparse
import os

def get_tlds():
    f = urllib.request.urlopen("https://publicsuffix.org/list/effective_tld_names.dat")
    content = f.read()
    lines = content.decode('utf-8').split("n")
    # remove comments
    tlds = [line for line in lines if not line.startswith("//") and not line == ""]
    return tlds

def extract_domain(url, tlds):
    # get domain
    url = url.replace("http://", "").replace("https://", "")
    url = url.split("/")[0]
    # get tld/sld
    parts = url.split(".")
    suffix1 = parts[-1]
    sld1 = parts[-2]
    if len(parts) > 2:
        suffix2 = ".".join(parts[-2:])
        sld2 = parts[-3]
    else:
        suffix2 = suffix1
        sld2 = sld1
    # try the longger first
    if suffix2 in tlds:
        tld = suffix2
        sld = sld2
    else:
        tld = suffix1
        sld = sld1
    return sld + "." + tld

def clean(site, tlds):
    site["domains"] = list(set([extract_domain(url, tlds) for url in site["domains"]]))
    return site

if __name__ == "__main__":
    filename = "Hzt9bq2a.json"

    cache_path = "tlds.json"
    if os.path.exists(cache_path):
        with open(cache_path, "r") as f:
            tlds = json.load(f)
    else:
        tlds = get_tlds()
        with open(cache_path, "w") as f:
            json.dump(tlds, f)
    
    with open(filename) as f:
        d = json.load(f)
        d = [clean(site, tlds) for site in d]
        pprint.pprint(d)
        with open("clean.json", "w") as f:
            json.dump(d, f)

- peak
- July 8, 2020 at 10:39 pm
- 0 votes
0
One of the solutions presented on this page offers that:

A programming language is much more expressive than jq.

It may therefore be worthwhile pointing out that jq is an expressive, Turing-complete programming language, and that it would be as straightforward (and as tedious) to capture all the intricacies of the "Public Suffix List" using jq as any other programming language that does not already provide support for this list.

It may be useful to illustrate an approach to the problem that passes the (revised) test presented in the Q. This approach could easily be extended in any one of a number of ways:
```
def extract:
  sub("^[^:]*://";"")
  | sub("/.*$";"")
  | split(".")
  | (if (.[-1]|length) == 2 and (.[-2]|length) <= 3
     then -3 else -2 end) as $ix
  | .[$ix : ]
  | join(".") ;
```
```
{name, domain: (.domains | map(extract) | unique)}
```
Output
```
{
  "name": "Visma Public",
  "domain": [
    "visma.com",
    "vismaonline.com",
    "workbox.co.uk",
    "workbox.dk"
  ]
}
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

jq – How to extract domains and remove duplicates – Debian

Answers

Update:

Output