skip to Main Content

Lately i’m facing some issues with a data mining bot, extracting data from my website everyday at certain hours of the day. This will not only waste my bandwidth but also giving wrong data to my google analytics.

They usually use amazonaws IPs to enter however lately they’ve switched to other host.

What remains static is that they use the same user agent. is there a way to block using useragent? This is because i’ve tried it but it failed. Hopefully i can get a light out of this.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu HeadlessChrome HeadlessChrome Safari/537.36
RewriteRule .* - [R=503,L]

Update:
This is my updated .htaacess just for future reference if it helps the community on how it should look. Thanks MrWhite

<LocationMatch .*>
  <IfModule mod_security2.c>
    SecRuleRemoveById 211170
    SecRuleRemoveById 211180    
  </IfModule>
</LocationMatch>


Options +FollowSymlinks

Options -Indexes

<FilesMatch "(?i)((.tpl|.ini|.log|(?<!robots).txt))">
 Require all denied
</FilesMatch>

# SEO URL Settings
RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} "=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu HeadlessChrome HeadlessChrome Safari/537.36"
RewriteRule .* - [F]

RewriteBase /
RewriteRule ^sitemap.xml$ index.php?route=extension/feed/google_sitemap [L]
RewriteRule ^googlebase.xml$ index.php?route=extension/feed/google_base [L]
RewriteRule ^system/download/(.*) index.php?route=error/not_found [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !.*.(ico|gif|jpg|jpeg|png|js|css)
RewriteRule ^([^?]*) index.php?_route_=$1 [L,QSA]

<Files 403.shtml>
order allow,deny
allow from all
</Files>

2

Answers


  1. RewriteCond %{HTTP_USER_AGENT} Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu HeadlessChrome HeadlessChrome Safari/537.36
    

    Spaces are delimiters in Apache config files. So you probably got an error about invalid flags (if you check the error log – the browser will likely just report a 500 error). You either need to backslash escape the spaces in the user-agent string, or enclose the entire user-agent (ie. CondPattern – 2nd argument to the RewriteCond directive) in double quotes. Also note that this is a regex by default, so any special/meta regex characters also need to be escaped (that includes ., ( and )).

    For example, try the following instead:

    RewriteCond %{HTTP_USER_AGENT} "^Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu HeadlessChrome HeadlessChrome Safari/537.36$"
    RewriteRule .* - [F]
    

    This will return a 403 Forbidden instead of a 503 Service Unavailable (which is really a temporary status).

    Alternatively, to perform a lexicographical string comparison (exact match), instead of a regex, you can use the = prefix operator on the CondPattern. For example:

    RewriteCond %{HTTP_USER_AGENT} "=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu HeadlessChrome HeadlessChrome Safari/537.36"
    

    The CondPattern is now treated as an ordinary string (not a regex) so there is no need to escape special characters.

    Needless to say, this should go at the top of your .htaccess file – together with any other blocking directives.


    UPDATE:

    If mod_rewrite directives are being overridden (perhaps from a .htaccess file in a subdirectory) then you can use a combination of mod_setenvif and mod_authz_core (Apache 2.4+), something like:

    BrowserMatch "^Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu HeadlessChrome HeadlessChrome Safari/537.36$" block_it
    <RequireAll>
    Require all granted
    Require not env block_it
    </RequireAll>
    

    As noted above, this is Apache 2.4+ syntax.

    Login or Signup to reply.
  2. A simpler and more generic way is to use the following which takes away all "Headless" requests. (I am not aware of any genuine, human, not-suspicious requests that are made under the "Headless" string, so AFAIK, it is safe to block them altogether)

    RewriteCond %{HTTP_USER_AGENT} (HeadlessChrome) [NC]
    RewriteRule .* - [F]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search