Wikipedia robots.txt raises exceptions
See original GitHub issueI’m scraping a page which in turn links to wikipedia.
But the wikipedia robots.txt is creating some errors/exceptions as below.
Python 2.7.12 Scrapy 1.2.1
2016-11-02 13:13:18 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/robots.txt> (referer: None)
2016-11-02 13:13:18 [py.warnings] WARNING: C:\Python27\lib\urllib.py:1303: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
return ''.join(map(quoter, s))
2016-11-02 13:13:18 [scrapy] ERROR: Error downloading <GET http://en.wikipedia.org/robots.txt>: u'\xd8'
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 587, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Python27\lib\site-packages\scrapy\downloadermiddlewares\robotstxt.py", line 97, in _parse_robots
rp.parse(body.splitlines())
File "C:\Python27\lib\robotparser.py", line 120, in parse
entry.rulelines.append(RuleLine(line[1], False))
File "C:\Python27\lib\robotparser.py", line 174, in __init__
self.path = urllib.quote(path)
File "C:\Python27\lib\urllib.py", line 1303, in quote
return ''.join(map(quoter, s))
KeyError: u'\xd8'
Issue Analytics
- State:
- Created 7 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Robots exclusion standard - Wikipedia
The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to indicate to...
Read more >T12648 Exclusion to robots.txt file for en.wikipedia.org
If at all possible, could that page be added to the robots.txt file so that random searching won't turn it up? Thanks! --SatyrTN ......
Read more >robots.txt - MediaWiki
robots.txt for http://www.wikipedia.org/ and friends # # Please note: ... There is a special exception for API mobileview to allow dynamic # mobile...
Read more >Even good bots fight: The case of Wikipedia | PLOS ONE
In recent years, there has been a huge increase in the number of bots online, varying from Web crawlers for search engines, to...
Read more >Bots - Official TF2 Wiki
Also, killing a bot with a Strange weapon will not increase the weapon's kill count, unless the weapon has a Strange Part: Robots...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This line seems to be the issue. Changing to
line[1] = w3lib.url.safe_url_string(line[1].strip())
seems to fix it.oh right, reading some recent PRs. Then it may be easier to use a custom RobotFileParser subclass for now.