skip to Main Content

I am trying to block our job board from being crawled. Can a specific URL be blocked with "Disallow" in a robot.txt file? And what would that look like for this URL? I don’t want to just Disallow HTML, only the URL for the URL jobs.example.com

Disallow: https://jobs.example.com/

2

Answers


  1. In order to disallow web crawlers from indexing one specific page, you can do so with the following lines:

    User-agent: *
    Disallow: /path/to/page/
    

    Or the entire website

    User-agent: *
    Disallow: /
    

    Note that not all search engines/crawlers will respect that file

    Login or Signup to reply.
  2. You can’t put full URLs into robots.txt disallow rules. Your proposed rule WON’T WORK as written:

    # INCORRECT
    Disallow: https://jobs.example.com/
    

    It looks like you might be trying to disallow crawling on the jobs subdomain. Doing so is possible. Each subdomain gets its own robots.txt file. You would have to configure you server to have different content for different robots.txt files:

    • https://example.com/robots.txt
    • https://jobs.example.com/robots.txt

    Then your jobs robot.txt should disallow all crawling on that subdomain:

    User-Agent: *
    Disallow: /
    

    If you are trying to disallow just the home page for that subdomain, you would have to use syntax that only the major search engines understand. You can use a $ for "ends with" and the major search engines will intepret it correctly:

    User-Agent: *
    Disallow: /$
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search