skip to Main Content

I am using Heroku pipes. So when I push my application it is pushed to staging app

https://appname.herokuapp.com/

and if everything is correct I promote that app to prodcution. There is no new build process. It is the same app that was build the first time for staging.

https://appname.com/

The thing is that this causes a problem with duplicate content. Sites are clones of each other. Exactly the same. I would like to exclude the staging app from Google indexing and search engine.

One way that I thought off was with robots.txt file.

For this to work I should write it like this

User-agent: *
Disallow: https://appname.herokuapp.com/

using the absolute path because this file will be on the server in staging and production application and I only wanna remove staging app from Google indexing and not touch the production one.

Is this the right way to do it?

2

Answers


  1. No, using what you’ve suggested would block all search engines / bots from accessing https://appname.herokuapp.com/.

    Instead what you should use is:

    User-agent: Googlebot
    Disallow: /
    

    This will only block Googlebot from accessing https://appname.herokuapp.com/. Keep in mind, bots can ignore the robots.txt file, this is more of a please than anything. But Google will follow your request.

    EDIT

    After seeing unor’s advice, it is not possible to Disallow by URL, so I’ve changed that from my answer. You can however block by particular files e.g. /appname/ or you use / to stop Googlebot from accessing anything.

    Login or Signup to reply.
  2. No, the Disallow field can’t take full URL references. Your robots.txt would block URLs like these:

    • https://example.com/https://appname.herokuapp.com/
    • https://example.com/https://appname.herokuapp.com/foo

    The Disallow value always represents the beginning of the URL’s path.

    To block all URLs under https://appname.herokuapp.com/, you would need:

    Disallow: /
    

    So you have to use different robots.txt files for https://appname.herokuapp.com/ and https://appname.com/.

    If you don’t mind bots crawling https://appname.herokuapp.com/, you could make use of noindex instead. But this would also require different behaviour for both sites. An alternative that doesn’t require different behaviour could be to make use of canonical. This conveys to crawlers which URL is preferred for indexing.

    <!-- on https://appname.herokuapp.com/foobar -->
    <link rel="canonical" href="https://appname.com/foobar" />
    
    <!-- on https://appname.com/foobar -->
    <link rel="canonical" href="https://appname.com/foobar" />
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search