I am using Heroku pipes. So when I push my application it is pushed to staging app
https://appname.herokuapp.com/
and if everything is correct I promote that app to prodcution. There is no new build process. It is the same app that was build the first time for staging.
https://appname.com/
The thing is that this causes a problem with duplicate content. Sites are clones of each other. Exactly the same. I would like to exclude the staging app from Google indexing and search engine.
One way that I thought off was with robots.txt file.
For this to work I should write it like this
User-agent: *
Disallow: https://appname.herokuapp.com/
using the absolute path because this file will be on the server in staging and production application and I only wanna remove staging app from Google indexing and not touch the production one.
Is this the right way to do it?
2
Answers
No, using what you’ve suggested would block all search engines / bots from accessing
https://appname.herokuapp.com/
.Instead what you should use is:
This will only block
Googlebot
from accessinghttps://appname.herokuapp.com/
. Keep in mind, bots can ignore therobots.txt
file, this is more of a please than anything. But Google will follow your request.EDIT
After seeing unor’s advice, it is not possible to Disallow by URL, so I’ve changed that from my answer. You can however block by particular files e.g.
/appname/
or you use/
to stop Googlebot from accessing anything.No, the
Disallow
field can’t take full URL references. Your robots.txt would block URLs like these:https://example.com/https://appname.herokuapp.com/
https://example.com/https://appname.herokuapp.com/foo
The
Disallow
value always represents the beginning of the URL’s path.To block all URLs under
https://appname.herokuapp.com/
, you would need:So you have to use different robots.txt files for
https://appname.herokuapp.com/
andhttps://appname.com/
.If you don’t mind bots crawling
https://appname.herokuapp.com/
, you could make use ofnoindex
instead. But this would also require different behaviour for both sites. An alternative that doesn’t require different behaviour could be to make use ofcanonical
. This conveys to crawlers which URL is preferred for indexing.