Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposed solution for indexing of old RTD pages by search engines

See original GitHub issue

Note: this is a complement to https://github.com/astropy/astropy/issues/7794 in which we discussed that we need a custom robots.txt to prevent indexing of old tags. This issue proposes an alternative solution

Background

In https://github.com/astropy/astropy/pull/7909, @dasdachs made it so that new versions of the docs on RTD have a meta tag:

<meta name="robots" content="noindex, nofollow">

if the version is not latest nor stable. This is good and works for the v2.0.x, v3.0.x, stable, and latest branches. It will also work for any new tagged version we release in future.

However, it doesn’t solve the problem for all the old tagged versions we have. Most noticeably, v0.2.1 seems to always appear near the top in Google searches, which is frustrating. One solution is to retag all the releases to add a commit that adds the meta tag, but I don’t think we want to re-tag all the versions, so that’s not an option.

Suggested approach

The key to the solution here is that RTD supports redirect patterns. Here’s what I suggest we do:

Preparing the astropy repository

To start off, we make an archive branch in the astropy repository that is an empty branch (i.e. does not share history with master), and we create a docs/ folder, which then contains an archive/ folder. In this folder, we place a static version (html, not rst) of all the problematic tagged versions that exist. Note that we don’t need to build these, we can just web scrape them with e.g.

wget -r -np docs.astropy.org/en/v0.2.2/

We then edit the Sphinx configuration to include:

html_extra_path = ['archive']

in it, which will cause any folder inside archive to get copied to the output _build/html directory.

Next up, we edit (with a script) all the html files inside archive/ to include the meta tag:

<meta name="robots" content="noindex, nofollow">

We then commit and publish the docs with RTD. Note that the build will be quite fast since it’s just copying static files.

Now at this point you’re probably thinking Jeez tom, isn’t that a lot of files to include in a branch of the main repo? Each version is several tens of Mb, so isn’t that like a Gb in total? I’m glad you asked - in fact I skipped over some details above. The archive folder can actually be a submodule that points to another repository! Let’s look at a real test case. I’ve made the following repository:

https://github.com/astropy/archived-documentation

which contains the static files. I’ve then made a repo with the sphinx configuration as described above, and archive is a submodule:

https://github.com/astrofrog/test-rtd/tree/archive/docs (this would be the archive branch of the astropy repo)

How do you know this works? I hear you ask. 🤔

Well, take a look at:

https://astrofrog-test-rtd.readthedocs.io/en/v0.2.1/

And notice the redirect!

ReadTheDocs configuration

Next up, we edit the RTD configuration. We set up a redirect for each tagged version that looks like this:

As you can guess, this causes /en/v0.2.1/<anything> to redirect to /en/archive/v0.2.1/<anything>

Once the redirects are in place for all tagged versions, we remove the existing builds of the tags from RTD. And that’s it!

Frequently asked questions

Isn’t that going to be a lot of work to do?

Not really - we can script the wget job above for each tag, and then script the insertion of the meta tags. The longest part will be setting up the redirects for a few dozen tagged versions (each one has to be done individually) but that will take ~10 min at most?

Isn’t that going to be a lot of work to maintain?

Nope! This is a one-off affair, as new tagged versions will include the meta tag by default, so won’t need this approach. This is purely for existing old tagged versions at this point in time.

But won’t this mean the versions won’t show up in the list of versions in the pop up menu when visiting docs.astropy.org?

It does, but that might not be a bad thing. Have you seen what that thing looks like at the moment??

Instead of the old tagged versions, there will be an ‘archive’ version that users can click on. We could populate the index.rst file of the archive branch to include links to all the archived versions and explain these are all old versions and are kept for reproducibility.

Are there other benefits to this?

I’m glad you asked! One benefit is that if we ever need to make other kinds of changes in future to old published versions, this is a good mechanism to do so. Also we don’t have to wait for RTD to implement support for custom robots.txt.

Thoughts/suggestions welcome!

I’m happy to do the work if we decide it’s a good idea. Note that we can also first test with a single tagged version if people would be more comfortable with that.

Issue Analytics

State:
Created 5 years ago
Comments:30 (30 by maintainers)

Top GitHub Comments

1reaction

astrofrogcommented, Nov 14, 2018

@bsipocz - no, I still have to set up the redirects, but I didn’t want to do that until I got explicit approval that we can move ahead with this. This is the point where things aren’t reversible, because I’m going to have to wipe the builds for the existing tags on RTD for the redirects to work.

So just to be explicit, @eteq and @bsipocz - shall I go ahead with this?

1reaction

astrofrogcommented, Nov 6, 2018

And the RTD build: http://docs.astropy.org/en/older-docs-archive/

Top Results From Across the Web

How Search Engines Crawl & Index: Everything You Need To ...

Indexing is where the ranking process begins after a website has been crawled. Indexing essentially refers to the adding of a webpage's content ......

12 Ways to improve your search index via a crawler or API

Here are some best practices to improve your search index, ensuring that your users find the content they need.

How Search Engines Work: Crawling, Indexing, and Ranking

Search engine robots, also called spiders, crawl from page to page to find new. Googlebot starts out by fetching a few web pages,...

TE Connectivity: Connectors & Sensors for a Connected ...

TE named to Dow Jones Sustainability Index for 11th consecutive year ... Our CTO for Industrial Solutions explains how TE is helping customers...

How to Get Google to Instantly Index Your New Website

More important than how often Google indexes your site is how many pages it's indexing. You want to ensure as many of the...