skip to Main Content

I’m using the following IIS Rewrite Rule to block as many bots as possible.

<rule name="BotBlock" stopProcessing="true">
  <match url=".*" />
  <conditions>
    <add input="{HTTP_USER_AGENT}" pattern="^$|b(?!.*googlebot.*b)w*(?:bot|crawl|spider)w*" />
  </conditions>
  <action type="CustomResponse" statusCode="403" statusReason="Forbidden" statusDescription="Forbidden" />
</rule>

The goal is to block all user agents with the parts bot, crawl or spider in it, but allow the Google Bot. This works to an extend. But the problem is that the second part of the regex is also triggered, even if "googlebot" is found in the string.

Below some examples what mean:

 Googlebot/2.1 (+http://www.google.com)

Works fine, the ‘bot’ part in googlebot is ignored and the request is permitted.

 Googlebot/2.1 (+http://www.google.com/bot.html)

Does not work, still triggers on the second ‘bot’ in the string and the request is blocked

 KHTML, like Gecko; compatible; bingbot

Works fine, is triggered on the bot in bingbot and the request is blocked

So can someone help me to change the rexeg so the string with Googlebot/2.1 (+http://www.google.com/bot.html) is allowed?

3

Answers


  1. I’m not familiar with IIS’s exact regex flavor (presumably ASP) but this should work if you can enable case-insensitive regex’ing:

    ^(?!.*googlebot).*(?:bot|crawl|spider)
    

    Explanation:

    • ^ – start line anchor
    • (?!.*googlebot) – ahead of me, the word "googlebot" does not exist
    • .*(?:bot|crawl|spider) – capture everything leading up to a positive match of the word "bot", "crawl", or "spider"

    The combination of negative look-ahead and positive forward capturing produces an implicit and condition in regex; both must be true in order for the regex to register a match.

    https://regex101.com/r/ri6Qs7/1


    To note: I am not sure why your regex starts with ^$| unless you are purposely looking to provide a 403 to requests with an empty user agent.

    Login or Signup to reply.
  2. This following pattern will not match if "http://www.google.com/&quot; is behind the "bot", which will allow that specific string to not be matched: ^$|b(?!.*Googlebot.*b)w*(?<!(+http://www.google.com/)(?:bot|crawl|spider)w*.

    Login or Signup to reply.
  3. Since this question specific to the IIS Url Rewrite Module, I’ll focus on that instead of Regex.

    You don’t have to solve this with a single magical regex pattern.
    The rule may consist of multiple conditions involving lighter regex patterns.

    <rule name="BotBlock" stopProcessing="true">
        <match url=".*" />
        <!-- logicalGrouping="MatchAll": all conditions must be met. -->
        <conditions logicalGrouping="MatchAll">
            <!-- User-Agent must not (negate="true") contain "googlebot" -->
            <add input="{HTTP_USER_AGENT}" pattern="googlebot" negate="true" />
            <!-- User-Agent must contain at least one of those words -->
            <add input="{HTTP_USER_AGENT}" pattern="bot|crawl|spider" />
        </conditions>
        <action type="CustomResponse" statusCode="403" statusReason="Forbidden" statusDescription="Forbidden" />
    </rule>
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search