Lately i’m facing some issues with a data mining bot, extracting data from my website everyday at certain hours of the day. This will not only waste my bandwidth but also giving wrong data to my google analytics.
They usually use amazonaws IPs to enter however lately they’ve switched to other host.
What remains static is that they use the same user agent. is there a way to block using useragent? This is because i’ve tried it but it failed. Hopefully i can get a light out of this.
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu HeadlessChrome HeadlessChrome Safari/537.36
RewriteRule .* - [R=503,L]
Update:
This is my updated .htaacess just for future reference if it helps the community on how it should look. Thanks MrWhite
<LocationMatch .*>
<IfModule mod_security2.c>
SecRuleRemoveById 211170
SecRuleRemoveById 211180
</IfModule>
</LocationMatch>
Options +FollowSymlinks
Options -Indexes
<FilesMatch "(?i)((.tpl|.ini|.log|(?<!robots).txt))">
Require all denied
</FilesMatch>
# SEO URL Settings
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} "=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu HeadlessChrome HeadlessChrome Safari/537.36"
RewriteRule .* - [F]
RewriteBase /
RewriteRule ^sitemap.xml$ index.php?route=extension/feed/google_sitemap [L]
RewriteRule ^googlebase.xml$ index.php?route=extension/feed/google_base [L]
RewriteRule ^system/download/(.*) index.php?route=error/not_found [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !.*.(ico|gif|jpg|jpeg|png|js|css)
RewriteRule ^([^?]*) index.php?_route_=$1 [L,QSA]
<Files 403.shtml>
order allow,deny
allow from all
</Files>
2
Answers
Spaces are delimiters in Apache config files. So you probably got an error about invalid flags (if you check the error log – the browser will likely just report a 500 error). You either need to backslash escape the spaces in the user-agent string, or enclose the entire user-agent (ie. CondPattern – 2nd argument to the
RewriteCond
directive) in double quotes. Also note that this is a regex by default, so any special/meta regex characters also need to be escaped (that includes.
,(
and)
).For example, try the following instead:
This will return a 403 Forbidden instead of a 503 Service Unavailable (which is really a temporary status).
Alternatively, to perform a lexicographical string comparison (exact match), instead of a regex, you can use the
=
prefix operator on the CondPattern. For example:The CondPattern is now treated as an ordinary string (not a regex) so there is no need to escape special characters.
Needless to say, this should go at the top of your
.htaccess
file – together with any other blocking directives.UPDATE:
If mod_rewrite directives are being overridden (perhaps from a
.htaccess
file in a subdirectory) then you can use a combination of mod_setenvif and mod_authz_core (Apache 2.4+), something like:As noted above, this is Apache 2.4+ syntax.
A simpler and more generic way is to use the following which takes away all "Headless" requests. (I am not aware of any genuine, human, not-suspicious requests that are made under the "Headless" string, so AFAIK, it is safe to block them altogether)