On my website I have a page for the cart, that is: http://www.example.com/cart
and another for the cartoons: http://www.example.com/cartoons
. How should I write on my robots.txt file to ignore only the cart page?
The cart page does not accept an ending slash on the URL, so if I do:
Disallow: /cart
, it will ignore /cartoon
too.
I don’t know if it’s possible and it will be correctly parsed by the spider bots something like /cart$
. I dont want to force Allow: /cartoon
because may be another pages with the same prefix.
2
Answers
You could explicitly allow and disallow both paths. More specific paths will take a higher precedent if they are longer in length:
More info is available at: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
In the original
robots.txt
specification, this is not possible. It neither supportsAllow
nor any characters with special meaning inside aDisallow
value.But some consumers support additional things. For example, Google gives a special meaning to the
$
sign, where it represents the end of the URL path:For Google, this will block
/cart
, but not/cartoon
.Consumers that don’t give this special meaning will interpret
$
literally, so they will block/cart$
, but not/cart
or/cartoon
.So if using this, you should specify the bots in
User-agent
.Alternative
Maybe you are fine with crawling but just want to prevent indexing? In that case you could use
meta
–robots
(with anoindex
value) instead ofrobots.txt
. Supporting bots will still crawl the/cart
page (and follow links, unless you also usenofollow
), but they won’t index it.