skip to Main Content

Is there any bash command/script in Linux so we can extract the active domains from a long list,

example, I have a csv file (domains.csv) there are 55 million domains are listed horizontally, we need only active domains in a csv file (active.csv)

Here active mean a domain who has a web page at least, not a domain who is expired or not expired. example whoisdatacenter.info is not expired but it has no webpage, we consider it as non-active.

I check google and stack website. I saw we can get domain by 2 ways. like

$ curl -Is google.com | grep -i location 
Location: http://www.google.com/

or 

nslookup google.com | grep -i name 
Name:   google.com

but I got no idea how can I write a program in bash for this for 55 million domains.

below commands, won’t give any result so I come up that nsloop and curl is wayway to get result

$ nslookup whoisdatacenter.info | grep -i name 
$ curl -Is whoisdatacenter.info | grep -i location 

1st 25 lines

$ head -25 domains.csv 

"
"0----0.info"
"0--0---------2lookup.com"
"0--0-------free2lookup.com"
"0--0-----2lookup.com"
"0--0----free2lookup.com"
"0--1.xyz"
"0--123456789.com"
"0--123456789.net"
"0--6.com"
"0--7.com"
"0--9.info"
"0--9.net"
"0--9.world"
"0--a.com"
"0--a.net"
"0--b.com"
"0--m.com"
"0--mm.com"
"0--reversephonelookup.com"
"0--z.com"
"0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0.com"
"0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0.com"
"0-0-0-0-0-0-0-0-0-0-0-0-0-10-0-0-0-0-0-0-0-0-0-0-0-0-0.info"

code I am running

while read line; 
do nslookup "$line" | awk '/Name/'; 
done < domains.csv > active3.csv

the result I am getting

 sh -x ravi2.sh 
+ read line
+ nslookup ''
+ awk /Name/
nslookup: '' is not a legal name (unexpected end of input)
+ read line
+ nslookup '"'
+ awk /Name/
+ read line
+ nslookup '"0----0.info"'
+ awk /Name/
+ read line
+ nslookup '"0--0---------2lookup.com"'
+ awk /Name/
+ read line
+ nslookup '"0--0-------free2lookup.com"'
+ awk /Name/
+ read line
+ nslookup '"0--0-----2lookup.com"'
+ awk /Name/
+ read line
+ nslookup '"0--0----free2lookup.com"'
+ awk /Name/

still, active3.csv is empty
below . the script is working, but something stopping the bulk lookup, either it’s in my host or something else.

while read line
do
nslookup $(echo "$line" | awk '{gsub(/r/,"");gsub(/.*-|"$/,"")} 1') | awk '/Name/{print}'
done < input.csv >> output.csv

The bulk nslookup show such error in below

server can't find facebook.com13: NXDOMAIN
[Solved] Ravi script is working perfectly fine, I was running in my MAC which gave Nslookup Error, I work in CentOS Linux server, Nslookup work great with Ravi script

Thanks a lot!!

2

Answers


  1. EDIT: Please try my EDIT solution as per OP’s shown samples.

    while read line
    do
       nslookup $(echo "$line" | awk '{gsub(/r/,"");gsub(/.*-|"$/,"")} 1') | awk '/Name/{found=1;next} found && /Address/{print $NF}'
    done < "Input_file"
    


    Could you please try following.

    OP has control M characters in her Input_file so run following command too remove them first:

    tr -d 'r' < Input_file > temp && mv temp Input_file
    

    Then run following code:

    while read line
    do
       nslookup "$line" | awk '/Name/{found=1;next} found && /Address/{print $NF}'
    done < "Input_file"
    

    I am assuming that since you are passing domain name you need to get their address(IP address) in output. Also since you are using a huge Input_file so it may be a bit slow in providing output, but trust me this is a simpler way.

    Login or Signup to reply.
  2. nslookup simply indicates whether or not the domain name has a record in DNS. Having one or more IP addresses does not automatically mean you have a web site; many IP addresses are allocated for different purposes altogether (but might coincidentally host a web site for another domain name entirely!)

    (Also, nslookup is not particularly friendly to scripting; you will want to look at dig instead for automation.)

    There is no simple way to visit 55 million possible web sites in a short time, and probably you should not be using Bash if you want to. See e.g. https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html for an exposition of various approaches based on Python.

    The immediate error message indicates that you have DOS carriage returns in your input file; this is a common FAQ which is covered very well over at Are shell scripts sensitive to encoding and line endings?

    You can run multiple curl instances in parallel but you will probably eventually saturate your network — experiment with various degrees of parallelism — maybe split up your file into smaller pieces and run each piece on a separate host with a separate network connection (perhaps in the cloud) but to quickly demonstrate,

    tr -d 'r' <file |
    xargs -P 256 -i sh -c 'curl -Is {} | grep Location'
    

    to run 256 instances of curl in parallel. You will still need to figure out which output corresponds to which input, so maybe refactor to something like

    tr -d 'r' <file |
    xargs -P 256 -i sh -c 'curl -Is {} | sed -n "s/Location/{}:&/p"'
    

    to print the input domain name in front of each output.

    (Maybe also note that just a domain name is not a complete URL. curl will helpfully attempt to add a “http://” in front and then connect to that, but that still doesn’t give you an accurate result if the domain only has a “https://” website and no redirect from the http:// one.)

    If you are on a Mac, where xargs doesn’t understand -i, try -I {} or something like

    tr -d 'r' <file |
    xargs -P 256 sh -c 'for url; do curl -Is "$url" | sed -n "s/Location/{}:&/p"; done' _
    

    The examples assume you didn’t already fix the DOS carriage returns once and for all; you probably really should (and consider dropping Windows from the equation entirely).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search