Let’s say you have a website hosted on example.org
. That website has a single page, whose content is static if the client requesting is not logged in, but dynamic (according to the logged client) if the client requesting it is logged in.
To properly handle this in terms of indexing on search engines, we currently thought of creating two separate files, let’s say logged_out.html
and logged_in.php
. When the URL (e.g. example.org
) is requested, the PHP code checks if the current user is logged in, and if so, it requires logged_in.php
, otherwise logged_out.html
.
The logged_in.php
has this in its head:
<meta name="robots" content="noindex,nofollow,noarchive">
As it does not make sense to index websites with dynamic contents in this system.
My question thus basically is how to program the serving / the routing of two pages, accessible under the same URL, with only one of them getting indexed. Such that the SE see only one and completely ignore the other one. Our current solution could be summed up like this:
// HTTP Request incoming to https://example.org/sample, which is routed to
// present file via server configs
if ($logged_in) {
require ("logged_in.php") // should never be indexed / crawled
} else {
require ("logged_out.html") // should be the indexed / crawled page
}
exit();
This should result in exclusively the contents of logged_out.html
being indexed, and that for the URL https://example.org/sample, while the contents of logged_in.php
should neither be indexed under any URL of example.org, nor be crawled.
Does our approach yield that intended result?
2
Answers
Search engines are not logged in by default and by all means. So the
robots
meta tag on thelogged_in.php
page is not needed.Also, they are unaware of pages that are being
required
by the mainindex.php
they visit. So what will most likely happen, the pageindex.php
will get indexed with content oflogged_out.html
. Just make sure to hide the real page to prevent duplicate content. Just keep it out of reach for the robots.This is your code as I understand it.
More thoughts:
In order to truly hide a page from getting indexed you should not expose its URL in links. You should place it in a deny access directory, or out of the web root. It could still be
required
don’t worry.Bonus: you can declare a duplicate page to be canonical of another.
A page gets indexed only if
One way is to have a parameter such as
A page with different parameters can be regarded as a different pages (for the Search Engine). This would work if there will be no internal links to the page with a parameter.
The other, better way to do it is to have a session variable that you write to when someone is logged in. When the Search Engine looks at it, this data is not there. Your code can set whatever data you need in it and it can retrieve it before presenting the modified page.