skip to Main Content

On my website I have a page for the cart, that is: http://www.example.com/cart and another for the cartoons: http://www.example.com/cartoons. How should I write on my robots.txt file to ignore only the cart page?

The cart page does not accept an ending slash on the URL, so if I do:
Disallow: /cart, it will ignore /cartoon too.

I don’t know if it’s possible and it will be correctly parsed by the spider bots something like /cart$. I dont want to force Allow: /cartoon because may be another pages with the same prefix.

2

Answers


  1. You could explicitly allow and disallow both paths. More specific paths will take a higher precedent if they are longer in length:

    disallow: /cart
    allow: /cartoon
    

    More info is available at: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

    Login or Signup to reply.
  2. In the original robots.txt specification, this is not possible. It neither supports Allow nor any characters with special meaning inside a Disallow value.

    But some consumers support additional things. For example, Google gives a special meaning to the $ sign, where it represents the end of the URL path:

    Disallow: /cart$
    

    For Google, this will block /cart, but not /cartoon.

    Consumers that don’t give this special meaning will interpret $ literally, so they will block /cart$, but not /cart or /cartoon.

    So if using this, you should specify the bots in User-agent.

    Alternative

    Maybe you are fine with crawling but just want to prevent indexing? In that case you could use metarobots (with a noindex value) instead of robots.txt. Supporting bots will still crawl the /cart page (and follow links, unless you also use nofollow), but they won’t index it.

    <!-- in the <head> of the /cart page -->
    <meta name="robots" content="noindex" />
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search