False negatives in robots.txt processing?
See original GitHub issuehttps://www.idealista.it has a robots.txt
which appears complex but essentially has the following: -
User-agent: *
Allow: /en/geo/
Scrapy (1.6.0) keeps telling me that where-ever I go on this site that I’m Forbidden by robots.tx: -
2019-02-23T11:06:44.226Z scrapy.downloadermiddlewares.robotstxt DEBUG # Forbidden by robots.txt: <GET https://www.idealista.it/en/geo/vendita-case/molise/>
I’m confused. I don’t think I should blocked and I suspect that Scrapy may be thrown by other instructions in the robots.txt
file.
I’m no expert by any means but when I validate an apparently legitimate URL (https://www.idealista.it/en/geo/vendita-case/molise/
) using an independent tool like http://tools.seobook.com/robots-txt/analyzer/ (and I’ve tried more than one to gain confidence) I’m told…
Url: https://www.idealista.it/en/geo/vendita-case/molise/
Multiple robot rules found
Robots allowed: All robots
So, is the robot.txt analysis in scrapy broken?
Scrapy tells me that everywhere on this site is blocked by the robots.txt. Just looking at the file myself, and not fully understanding the order of precedence, that just doesn’t seem right.
- If the answer is “Scrapy is correct” then why does it conflict with other analysers?
- Is there more I need to configure in Scrapy?
- Is there some middlewhere I’m missing?
- And, most importantly, how do I continue to use Scrapy now and analyse sites like this? Suggestions I don’t want are:
circumvent robots with set ROBOTSTXT_OBEY = False
orwrite your own robots.txt analyser
.
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
I forgot to mention that partial wildcards (
Disallow:/*?ordine=stato-asc
) will also not trigger for the same reason as$
(because ofstartswith()
). Complete wildcards (Allow: *
) will be fine because the parser checks that explicitly, but that’s usually only to override the default rules for a specific bot.By itself, no. The latter would include the former. For all we know there might’ve been some historical reasons and they never bothered to remove unnecessary entries since the big search engines don’t have an issue.
This is fine with a smarter parser. So by default disallow all English pages, except for those on the
geo
sub-path. That would be shorthand for explicitlyDisallow
ing every non-/en/geo
path.That’s a good idea - sometimes a robots.txt might technically allow you, but their T&C’s don’t. However. if you can get them to officially support your bot with an explicit
User-Agent: ABCspider
and simpler rules, then that’ll get you around the messy rules that apply to everyone else. Or they can just rearrange it so that a stricter parser understands it correctly.@maramsumanth Yes, they are practically the same. In Scrapy it will never match the rule with
$
anyway. In a smarter parser it also doesn’t make a difference, because both rules are there.