question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

False negatives in robots.txt processing?

See original GitHub issue

https://www.idealista.it has a robots.txt which appears complex but essentially has the following: -

  • User-agent: *
  • Allow: /en/geo/

Scrapy (1.6.0) keeps telling me that where-ever I go on this site that I’m Forbidden by robots.tx: -

2019-02-23T11:06:44.226Z scrapy.downloadermiddlewares.robotstxt DEBUG # Forbidden by robots.txt: <GET https://www.idealista.it/en/geo/vendita-case/molise/>

I’m confused. I don’t think I should blocked and I suspect that Scrapy may be thrown by other instructions in the robots.txt file.

I’m no expert by any means but when I validate an apparently legitimate URL (https://www.idealista.it/en/geo/vendita-case/molise/) using an independent tool like http://tools.seobook.com/robots-txt/analyzer/ (and I’ve tried more than one to gain confidence) I’m told…

Url: https://www.idealista.it/en/geo/vendita-case/molise/
Multiple robot rules found 
Robots allowed: All robots

So, is the robot.txt analysis in scrapy broken?

Scrapy tells me that everywhere on this site is blocked by the robots.txt. Just looking at the file myself, and not fully understanding the order of precedence, that just doesn’t seem right.

  1. If the answer is “Scrapy is correct” then why does it conflict with other analysers?
  2. Is there more I need to configure in Scrapy?
  3. Is there some middlewhere I’m missing?
  4. And, most importantly, how do I continue to use Scrapy now and analyse sites like this? Suggestions I don’t want are: circumvent robots with set ROBOTSTXT_OBEY = False or write your own robots.txt analyser.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
malbertscommented, Feb 26, 2019

I forgot to mention that partial wildcards (Disallow:/*?ordine=stato-asc) will also not trigger for the same reason as $ (because of startswith()). Complete wildcards (Allow: *) will be fine because the parser checks that explicitly, but that’s usually only to override the default rules for a specific bot.

Does this, according to the standard, make any sense at all…?

Disallow: /en/node/
Disallow: /en/

By itself, no. The latter would include the former. For all we know there might’ve been some historical reasons and they never bothered to remove unnecessary entries since the big search engines don’t have an issue.

And this…

Disallow: /en/
Allow: /en/geo/

This is fine with a smarter parser. So by default disallow all English pages, except for those on the geo sub-path. That would be shorthand for explicitly Disallowing every non-/en/geo path.

I suppose, to be really “safe”, I should contact the website author for clarification on the intension of such a rule.

That’s a good idea - sometimes a robots.txt might technically allow you, but their T&C’s don’t. However. if you can get them to officially support your bot with an explicit User-Agent: ABCspider and simpler rules, then that’ll get you around the messy rules that apply to everyone else. Or they can just rearrange it so that a stricter parser understands it correctly.

1reaction
malbertscommented, Mar 22, 2019

@maramsumanth Yes, they are practically the same. In Scrapy it will never match the rule with $ anyway. In a smarter parser it also doesn’t make a difference, because both rules are there.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Robots.txt Introduction and Guide | Google Search Central
Robots.txt is used to manage crawler traffic. Explore this robots.txt introduction guide to learn what robot.txt files are and how to use them....
Read more >
6 Common Robots.txt Issues & And How To Fix Them
Discover the most common robots.txt issues, the impact they can have on your website and your search presence, and how to fix them....
Read more >
Analyzing One Million robots.txt Files - Intoli
Insights gathered from analyzing the robots.txt files of Alexa's top ... filter less false negatives in exchange for more false positives ...
Read more >
What Is Robots.txt File and How to Configure It Correctly
If your website has no robot txt file, your website will be crawled entirely. It means that all website pages will get into...
Read more >
Robots.txt checker: is your robots.txt set up correctly?
Without any context, a robots.txt checker can only check whether you have any syntax mistakes or whether you're using deprecated directives such ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found