Recently I saw a site’s robots.txt as follows:
User-agent: *
Allow: /login
Allow: /register
I could find only Allow
entries and no Disallow
entries.
From this, I could understand robots.txt is nearly a blacklist file to Disallow
pages to be crawled. So, Allow
is used only to allow a sub part of domain which is already blocked with Disallow
. Similar to this:
Allow: /crawlthis
Disallow: /
But, that robots.txt has no Disallow
entries. So, does this robots.txt let Google crawl all the pages? Or, does it allow only the specified pages tagged with Allow
?
2
Answers
You are right that this robots.txt file allows Google to crawl all the pages on the website. A thorough guide can be found here: http://www.robotstxt.org/robotstxt.html.
If you want googleBot to only be allowed to crawl the specified pages then correct format would be:
(I would normally disallow those specific pages though as they don’t provide much value to searchers.)
It’s important to note that the Allow command line only works with some robots (including Googlebot)
There is no point in having a robots.txt record that has
Allow
lines but noDisallow
lines. Everything is allowed to be crawled by default anyway.According to the original robots.txt specification (which doesn’t define
Allow
), it’s even invalid, as at least oneDisallow
line is required (bold emphasis mine):In other words, a record like
is equivalent to the record
i.e., everything is allowed to be crawled, including (but not limited to) URLs with paths that start with
/login
and/register
.