skip to Main Content

I’m trying to find out all those bots who are just doing key word harvesting and (for me) useless SEO selling – like MJ12bot or AhrefsBot. My site is something like 24 years old, most of time under WordPress, but I tried mediawiki at some point etc so I’ll get a lot of 404s.

I have Apache-Varnish-Wordpress stack and I’m using Varnish to stop unwanted bots. The bot.vlc is like this:

sub bad_bot_detection {

if (
  req.http.user-agent ~ "Daum"
  || req.http.user-agent ~ "MJ12bot"
...
) {
    return(synth(403, "Forbidden Bots"));
} elseif (
    req.http.user-agent ~ "APIs-Google"
    || req.http.user-agent ~ "Mediapartners-Google"
...
) {return(pipe);
} else {
    unset req.http.User-Agent;
    }
}    

At the backend/wordpress 404 monitoring is done by Rank Math SEO plugin. I’m using return(pipe); for “good bots” just getting User Agent. Otherwise I don’t know when I should fix 404 and when just not care of. Humans aren’t an issue because if they get 404 then there is a referer. So, I would like to find out user agent of bots that I could offer to them nice error 403.

I googled a alot and all hits of varnish and user agent are type how to serve different cache to mobiles or tips why vary: user agent is bad for caching ratio. Some articles adviced to use log of Apache2 but it didn’t help too much because of unset req.http.User-Agent; in default.vcl. I know all those but I’m trying just pass the name of user agent to 404 monitoring without telling it to varnish.

Maybe I should use logging of Varnish but I couldn’t find user agents from there either.

So, should I just learn to live with a lot of 404s or copy&paste all those “bad bot lists”? Can I use Varnish for hunting name of bots giving 404’s at all?

EDIT: there is a language barrier (it was muvh easier if you would learn some finnish 😉 ) so let’s look some screenshots.

This is what I get:
when Varnish cleans up user agent

This is what I need, but without separate “cache buckets” per user agent:
showing user agents

(EDIT &) CLOSING (for now anyway)

It is impossible send user-agent to backend apps like WordPress without using return(pipe); witch is quite bad idea with Varnish.

Using return(pass); doesn’t work either because it just doesn’t use cache but will do anything else like removing vary:user-agent – anyway, even it would work it’s bad idea just sending user-agent to 404 monitor, though.

That is a bit frustrating actually. 404s by useless bots don’t need any fixing but googlebot/bingbot/etc needs and now I can’t know who is who. So I did a bad-bot.vcl to stop known bad/seo-harvesters and let google/bing/etc go through Varnish using return(pipe) so I get theirs user-agent to 404 monitor. I may (or may not) loose some SEO now because of a little bit slower loading times but that isn’t bigger problem.

2

Answers


  1. Chosen as BEST ANSWER

    The answer was quite easy.

    1. Normalize or not at Varnish, but don't unset req.http.User-Agent
    2. Use this:
    sub vcl_backend_response {
        
        if (beresp.http.Vary ~ "User-Agent") {
            set beresp.http.Vary = regsuball(beresp.http.Vary, ",? *User-Agent *", "");
            set beresp.http.Vary = regsub(beresp.http.Vary, "^, *", "");
            if (beresp.http.Vary == "") {
                unset beresp.http.Vary;
            }
        }
    }
    

    Now Varnish won't cache using User-Agent (by Age and cache hits) and the backend, here WordPress, gets User-Agent. Now it is much easier to see right away if 404 cames from real user/bot.

    And normalizing user-agent makes reading easier, but that is totally different story.


  2. if you return pipe from that moment on Varnish won’t see anything else of the piped transaction.
    try to return pass instead and using varnishlog check 404 responses.

    i.e. varnishlog -d -q “RespStatus > 400” | grep “user-agent”

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search