I am trying to block certain user agents from access the search pages, mostly the bots and crawlers as they end up increasing the CPU usage.
Using htaccess Rewrite engine of course. And I currently have this (have been trying with a lot of different combination of rules that I found on SO and other places)
# Block user agents
ErrorDocument 503 "Site temporarily disabled for crawling"
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(bots).*$ [NC]
# RewriteCond %{QUERY_STRING} ^s=(.*)$
# RewriteCond /shop(?:.php)??s=([^s&]+)
RewriteCond %{QUERY_STRING} !(^|&)s=*
RewriteCond %{REQUEST_URI} !^/robots.txt$
# RewriteRule ^shop*$ /? [R=503,L]
RewriteRule ^shop$ ./$1 [R=503,L]
Sorry about the many commented out lines – as I mentioned, I have been trying a lot of different things but it appears that htaccess rewrite rules are not my cup of tea.
What I want to do is, if the user agent contains "bot" then return a 503 error. Conditions are
- User agent contains "bots" – this part is working fine, I tested it
- If there is a
s
query string, with anything in it. - It’s not robots.txt url (at this point I think I should remove it, not even needed)
- Finally, if the above matches redirect
/shop/?s=
or/shop?s=
to root and serve 503 error document.
3
Answers
Since you clearly defined the criteria by which to decide it is straight forward to implement them. I understand your question such that all of those criteria have to be fulfilled …
Not sure why you test for "bots" instead of "bot" though (your question contradicts itself in that).
With your shown samples/attempts, please try following htaccess Rules file.
Please make sure to clear your browser cache before testing your URLs.
You can’t "redirect to root" and "serve 503 error document". A redirect is a 3xx response. You could internally rewrite the request to root and send a 503 HTTP response status by defining
ErrorDocument 503 /
. However, that rather defeats the point of a 503 and does not help "CPU usage". Just serving the static string for the 503 response, as you have already defined would seem to be the best option.When you set a code other than in the 3xx range, the substitution string (ie.
./$1
) is completely ignored. You should simply include a hyphen (-
) as the substitution string to explicitly indicate "no substitution". TheL
flag is also superfluous here. When specifying a non-3xx return code, theL
flag is implied.Agreed, this check is entirely superfluous. You are already checking that the request is
/shop
, so it cannot possibly be/robots.txt
as well. (Any request for/robots.txt
does not include a query string either.)For some reason you’ve negated (
!
prefix) the condition (perhaps in an attempt to make it work?) – but you need this to be a positive match for thes
URL parameter. Note that the regex(^|&)s=*
is incorrect. The trailing*
matches the preceding pattern 0 or more times. So this regex will only matchs
ors=
ors==
and fail fors=foo
.To match the
s
URL parameter with "anything in it" (including nothing) you simply need to remove the trailing*
, eg.(^|&)s=
. To match thes
URL param with something as the value then match a single character, except&
. eg.(^|&)s=[^&]
."bots" or "bot", as mentioned in the preceding sentence? I can’t imagine "bots" matching many bots, but "bot" will match a lot!
This regex is rather complex for what it does. It also unnecessarily captures the word "bots" (which isn’t being used later). The regex
^.*(bots).*$
is the same as simplybots
(without the capturing group).Taking the above points into consideration, we have:
The above rule does the following:
shop
orshop/
s=
is present (anywhere)However, I would query whether a 503 is really the correct response. Generally, you don’t want bots to crawl internal search results at all; ever. It can be a bottomless pit and wastes crawl budget. Should this perhaps be a 403 instead?
And, are you blocking these URLs in
robots.txt
already? (And this isn’t sufficient to stop the "bad" bots?) If not, I would consider adding the following: