Proposed solution for indexing of old RTD pages by search engines
See original GitHub issueNote: this is a complement to https://github.com/astropy/astropy/issues/7794 in which we discussed that we need a custom robots.txt to prevent indexing of old tags. This issue proposes an alternative solution
Background
In https://github.com/astropy/astropy/pull/7909, @dasdachs made it so that new versions of the docs on RTD have a meta tag:
<meta name="robots" content="noindex, nofollow">
if the version is not latest
nor stable
. This is good and works for the v2.0.x, v3.0.x, stable, and latest branches. It will also work for any new tagged version we release in future.
However, it doesn’t solve the problem for all the old tagged versions we have. Most noticeably, v0.2.1 seems to always appear near the top in Google searches, which is frustrating. One solution is to retag all the releases to add a commit that adds the meta tag, but I don’t think we want to re-tag all the versions, so that’s not an option.
Suggested approach
The key to the solution here is that RTD supports redirect patterns. Here’s what I suggest we do:
Preparing the astropy repository
To start off, we make an archive
branch in the astropy repository that is an empty branch (i.e. does not share history with master
), and we create a docs/
folder, which then contains an archive/
folder. In this folder, we place a static version (html, not rst) of all the problematic tagged versions that exist. Note that we don’t need to build these, we can just web scrape them with e.g.
wget -r -np docs.astropy.org/en/v0.2.2/
We then edit the Sphinx configuration to include:
html_extra_path = ['archive']
in it, which will cause any folder inside archive to get copied to the output _build/html
directory.
Next up, we edit (with a script) all the html files inside archive/
to include the meta tag:
<meta name="robots" content="noindex, nofollow">
We then commit and publish the docs with RTD. Note that the build will be quite fast since it’s just copying static files.
Now at this point you’re probably thinking Jeez tom, isn’t that a lot of files to include in a branch of the main repo? Each version is several tens of Mb, so isn’t that like a Gb in total? I’m glad you asked - in fact I skipped over some details above. The archive folder can actually be a submodule that points to another repository! Let’s look at a real test case. I’ve made the following repository:
https://github.com/astropy/archived-documentation
which contains the static files. I’ve then made a repo with the sphinx configuration as described above, and archive is a submodule:
https://github.com/astrofrog/test-rtd/tree/archive/docs (this would be the archive branch of the astropy repo)
How do you know this works? I hear you ask. 🤔
Well, take a look at:
https://astrofrog-test-rtd.readthedocs.io/en/v0.2.1/
And notice the redirect!
ReadTheDocs configuration
Next up, we edit the RTD configuration. We set up a redirect for each tagged version that looks like this:
As you can guess, this causes /en/v0.2.1/<anything> to redirect to /en/archive/v0.2.1/<anything>
Once the redirects are in place for all tagged versions, we remove the existing builds of the tags from RTD. And that’s it!
Frequently asked questions
Isn’t that going to be a lot of work to do?
Not really - we can script the wget job above for each tag, and then script the insertion of the meta tags. The longest part will be setting up the redirects for a few dozen tagged versions (each one has to be done individually) but that will take ~10 min at most?
Isn’t that going to be a lot of work to maintain?
Nope! This is a one-off affair, as new tagged versions will include the meta tag by default, so won’t need this approach. This is purely for existing old tagged versions at this point in time.
But won’t this mean the versions won’t show up in the list of versions in the pop up menu when visiting docs.astropy.org?
It does, but that might not be a bad thing. Have you seen what that thing looks like at the moment??
Instead of the old tagged versions, there will be an ‘archive’ version that users can click on. We could populate the index.rst file of the archive branch to include links to all the archived versions and explain these are all old versions and are kept for reproducibility.
Are there other benefits to this?
I’m glad you asked! One benefit is that if we ever need to make other kinds of changes in future to old published versions, this is a good mechanism to do so. Also we don’t have to wait for RTD to implement support for custom robots.txt
.
Thoughts/suggestions welcome!
I’m happy to do the work if we decide it’s a good idea. Note that we can also first test with a single tagged version if people would be more comfortable with that.
Issue Analytics
- State:
- Created 5 years ago
- Comments:30 (30 by maintainers)
Top GitHub Comments
@bsipocz - no, I still have to set up the redirects, but I didn’t want to do that until I got explicit approval that we can move ahead with this. This is the point where things aren’t reversible, because I’m going to have to wipe the builds for the existing tags on RTD for the redirects to work.
So just to be explicit, @eteq and @bsipocz - shall I go ahead with this?
And the RTD build: http://docs.astropy.org/en/older-docs-archive/