I have a bash script that sorts the highest hits on my server logs, and prints the IP address and user agent:
cat /var/log/apache2/proxy.example.com.access.log | awk -F'|' '{print $5 $11}' | sort -n | uniq -c | sort -nr | head -30
It prints out a result like this:
COUNT IP Address User Agent
37586 66.249.73.223 "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
23960 84.132.153.226 "-" <--- I do need to see things like this
13246 17.58.103.219 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)" <--- But not this
10572 66.249.90.191 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246 Mozilla/5.0"
9505 66.249.73.223 "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
5157 66.249.73.193 "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I am not concerned with bots such as Googlebot, Bingbot, Applebot, etc. Is there a way I can get the same format, but excluding these friendly bots?
I am able to exclude Googlebot with:
cat /var/log/apache2/proxy.example.com.access.log | awk -F'|' '{print $5 $11}' | grep -v "Googlebot" | sort -n | uniq -c | sort -nr | head -30
But I would like to exclude multiple bots.
I also did:
cat /var/log/apache2/proxy.example.com.access.log | awk -F'|' '{print $5 $11}' | grep -v "Googlebot" | grep -v "bingbot" | grep -v "Applebot" | sort -n | uniq -c | sort -nr | head -30
which seems to work, but is that proper bash syntax to pipe several greps?
2
Answers
I found a much cleaner way to do it instead of multiple 'grep -v'. I used egrep:
Unless someone has a better way, this works perfectly for me.
You can also use
grep -F -v -e <phrase1> -e <phrase2> ... -e <phraseN>
as in the following:
-F
tells grep to treat the search strings as a fixed string … this is usually much faster than using regular expressions-e
allows you to specify an expression. Using multiple-e
flags lets you combine multiple expressions for use in a single grep command.Alternatively, you can use a “blacklist” file, and do something like the following:
where the contents of blacklist.txt are:
… the benefit to this is that when you find a new entry you want to ignore, you can just add it to the blacklist instead of modifying your script … it’s also quite readable.
edit: You can also move the
-r
argument to your firstsort
and avoid the second call altogether. Also, because you’re usingawk
, you could get rid of grep altogether (mind you, at the cost of using regexes, but since it’s already processing every line in the file, you might save more time on the i/o):I would also suggest getting rid of the leading cat, because
awk
will open the file for reading without modification (unless you tell it to modify the file):and since you know the location of the fields, you could also use
sed
which will be faster than usingawk
… I leave that as an exercise to the reader (just keep in mind the indexed search results:ls | sed -n 's/(.*).txt/1/p'
results in all the ‘*.txt’ files being printed out without their file extension)