skip to Main Content

I have a bash script that sorts the highest hits on my server logs, and prints the IP address and user agent:

cat /var/log/apache2/proxy.example.com.access.log | awk -F'|' '{print $5 $11}' | sort -n | uniq -c | sort -nr | head -30

It prints out a result like this:

COUNT   IP Address  User Agent

37586  66.249.73.223  "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
23960  84.132.153.226  "-" <--- I do need to see things like this
13246  17.58.103.219  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)" <--- But not this
10572  66.249.90.191  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246 Mozilla/5.0"
 9505  66.249.73.223  "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
 5157  66.249.73.193  "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

I am not concerned with bots such as Googlebot, Bingbot, Applebot, etc. Is there a way I can get the same format, but excluding these friendly bots?

I am able to exclude Googlebot with:

cat /var/log/apache2/proxy.example.com.access.log | awk -F'|' '{print $5 $11}' | grep -v "Googlebot" | sort -n | uniq -c | sort -nr | head -30

But I would like to exclude multiple bots.

I also did:

cat /var/log/apache2/proxy.example.com.access.log | awk -F'|' '{print $5 $11}' | grep -v "Googlebot" | grep -v "bingbot" | grep -v "Applebot" | sort -n | uniq -c | sort -nr | head -30

which seems to work, but is that proper bash syntax to pipe several greps?

2

Answers


  1. Chosen as BEST ANSWER

    I found a much cleaner way to do it instead of multiple 'grep -v'. I used egrep:

    cat /var/log/apache2/proxy.example.com.access.log | awk -F'|' '{print $5 $11}' | egrep -v "Googlebot|bingbot|Applebot" | sort -n | uniq -c | sort -nr | head -30
    

    Unless someone has a better way, this works perfectly for me.


  2. You can also use grep -F -v -e <phrase1> -e <phrase2> ... -e <phraseN>
    as in the following:

    cat /var/log/apache2/proxy.example.com.access.log | awk -F'|' '{print $5 $11}' | grep -F -v -e "Googlebot" -e "bingbot" -e "Applebot" | sort -n | uniq -c | sort -nr | head -30
    

    -F tells grep to treat the search strings as a fixed string … this is usually much faster than using regular expressions

    -e allows you to specify an expression. Using multiple -e flags lets you combine multiple expressions for use in a single grep command.

    Alternatively, you can use a “blacklist” file, and do something like the following:

    cat /var/log/apache2/proxy.example.com.access.log | awk -F'|' '{print $5 $11}' | grep -F -f blacklist.txt -v | sort -n | uniq -c | sort -nr | head -30
    

    where the contents of blacklist.txt are:

    Applebot
    Googlebot
    bingbot
    

    … the benefit to this is that when you find a new entry you want to ignore, you can just add it to the blacklist instead of modifying your script … it’s also quite readable.

    edit: You can also move the -r argument to your first sort and avoid the second call altogether. Also, because you’re using awk, you could get rid of grep altogether (mind you, at the cost of using regexes, but since it’s already processing every line in the file, you might save more time on the i/o):

    cat /var/log/apache2/proxy.example.com.access.log | awk -F'|' '!/Applebot|Googlebot|bingbot/{print $5 $11}' | sort -nr | uniq -c | head -30
    

    I would also suggest getting rid of the leading cat, because awk will open the file for reading without modification (unless you tell it to modify the file):

    awk -F'|' '!/Applebot|Googlebot|bingbot/{print $5 $11}' /var/log/apache2/proxy.example.com.access.log | sort -nr | uniq -c | head -30
    

    and since you know the location of the fields, you could also use sed which will be faster than using awk … I leave that as an exercise to the reader (just keep in mind the indexed search results: ls | sed -n 's/(.*).txt/1/p' results in all the ‘*.txt’ files being printed out without their file extension)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search