Is there any bash command/script in Linux so we can extract the active domains from a long list,
example, I have a csv file (domains.csv) there are 55 million domains are listed horizontally, we need only active domains in a csv file (active.csv)
Here active mean a domain who has a web page at least, not a domain who is expired or not expired. example whoisdatacenter.info is not expired but it has no webpage, we consider it as non-active.
I check google and stack website. I saw we can get domain by 2 ways. like
$ curl -Is google.com | grep -i location
Location: http://www.google.com/
or
nslookup google.com | grep -i name
Name: google.com
but I got no idea how can I write a program in bash for this for 55 million domains.
below commands, won’t give any result so I come up that nsloop and curl is wayway to get result
$ nslookup whoisdatacenter.info | grep -i name
$ curl -Is whoisdatacenter.info | grep -i location
1st 25 lines
$ head -25 domains.csv
"
"0----0.info"
"0--0---------2lookup.com"
"0--0-------free2lookup.com"
"0--0-----2lookup.com"
"0--0----free2lookup.com"
"0--1.xyz"
"0--123456789.com"
"0--123456789.net"
"0--6.com"
"0--7.com"
"0--9.info"
"0--9.net"
"0--9.world"
"0--a.com"
"0--a.net"
"0--b.com"
"0--m.com"
"0--mm.com"
"0--reversephonelookup.com"
"0--z.com"
"0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0.com"
"0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0.com"
"0-0-0-0-0-0-0-0-0-0-0-0-0-10-0-0-0-0-0-0-0-0-0-0-0-0-0.info"
code I am running
while read line;
do nslookup "$line" | awk '/Name/';
done < domains.csv > active3.csv
the result I am getting
sh -x ravi2.sh
+ read line
+ nslookup ''
+ awk /Name/
nslookup: '' is not a legal name (unexpected end of input)
+ read line
+ nslookup '"'
+ awk /Name/
+ read line
+ nslookup '"0----0.info"'
+ awk /Name/
+ read line
+ nslookup '"0--0---------2lookup.com"'
+ awk /Name/
+ read line
+ nslookup '"0--0-------free2lookup.com"'
+ awk /Name/
+ read line
+ nslookup '"0--0-----2lookup.com"'
+ awk /Name/
+ read line
+ nslookup '"0--0----free2lookup.com"'
+ awk /Name/
still, active3.csv is empty
below . the script is working, but something stopping the bulk lookup, either it’s in my host or something else.
while read line
do
nslookup $(echo "$line" | awk '{gsub(/r/,"");gsub(/.*-|"$/,"")} 1') | awk '/Name/{print}'
done < input.csv >> output.csv
The bulk nslookup show such error in below
server can't find facebook.com 13: NXDOMAIN
[Solved]
Ravi script is working perfectly fine, I was running in my MAC which gave Nslookup Error, I work in CentOS Linux server, Nslookup work great with Ravi script
Thanks a lot!!
2
Answers
EDIT: Please try my EDIT solution as per OP’s shown samples.
Could you please try following.
OP has control M characters in her Input_file so run following command too remove them first:
Then run following code:
I am assuming that since you are passing domain name you need to get their address(IP address) in output. Also since you are using a huge Input_file so it may be a bit slow in providing output, but trust me this is a simpler way.
nslookup
simply indicates whether or not the domain name has a record in DNS. Having one or more IP addresses does not automatically mean you have a web site; many IP addresses are allocated for different purposes altogether (but might coincidentally host a web site for another domain name entirely!)(Also,
nslookup
is not particularly friendly to scripting; you will want to look atdig
instead for automation.)There is no simple way to visit 55 million possible web sites in a short time, and probably you should not be using Bash if you want to. See e.g. https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html for an exposition of various approaches based on Python.
The immediate error message indicates that you have DOS carriage returns in your input file; this is a common FAQ which is covered very well over at Are shell scripts sensitive to encoding and line endings?
You can run multiple
curl
instances in parallel but you will probably eventually saturate your network — experiment with various degrees of parallelism — maybe split up your file into smaller pieces and run each piece on a separate host with a separate network connection (perhaps in the cloud) but to quickly demonstrate,to run 256 instances of
curl
in parallel. You will still need to figure out which output corresponds to which input, so maybe refactor to something liketo print the input domain name in front of each output.
(Maybe also note that just a domain name is not a complete URL.
curl
will helpfully attempt to add a “http://” in front and then connect to that, but that still doesn’t give you an accurate result if the domain only has a “https://” website and no redirect from the http:// one.)If you are on a Mac, where
xargs
doesn’t understand-i
, try-I {}
or something likeThe examples assume you didn’t already fix the DOS carriage returns once and for all; you probably really should (and consider dropping Windows from the equation entirely).