Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Wikipedia robots.txt raises exceptions

See original GitHub issue

I’m scraping a page which in turn links to wikipedia.

But the wikipedia robots.txt is creating some errors/exceptions as below.

Python 2.7.12 Scrapy 1.2.1

2016-11-02 13:13:18 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/robots.txt> (referer: None)
2016-11-02 13:13:18 [py.warnings] WARNING: C:\Python27\lib\urllib.py:1303: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  return ''.join(map(quoter, s))

2016-11-02 13:13:18 [scrapy] ERROR: Error downloading <GET http://en.wikipedia.org/robots.txt>: u'\xd8'
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 587, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Python27\lib\site-packages\scrapy\downloadermiddlewares\robotstxt.py", line 97, in _parse_robots
    rp.parse(body.splitlines())
  File "C:\Python27\lib\robotparser.py", line 120, in parse
    entry.rulelines.append(RuleLine(line[1], False))
  File "C:\Python27\lib\robotparser.py", line 174, in __init__
    self.path = urllib.quote(path)
  File "C:\Python27\lib\urllib.py", line 1303, in quote
    return ''.join(map(quoter, s))
KeyError: u'\xd8'

Issue Analytics

State:
Created 7 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

2reactions

redapplecommented, Nov 2, 2016

This line seems to be the issue. Changing to line[1] = w3lib.url.safe_url_string(line[1].strip()) seems to fix it.

0reactions

redapplecommented, Nov 2, 2016

oh right, reading some recent PRs. Then it may be easier to use a custom RobotFileParser subclass for now.

Top Results From Across the Web

Robots exclusion standard - Wikipedia

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to indicate to...

T12648 Exclusion to robots.txt file for en.wikipedia.org

If at all possible, could that page be added to the robots.txt file so that random searching won't turn it up? Thanks! --SatyrTN ......

robots.txt - MediaWiki

robots.txt for http://www.wikipedia.org/ and friends # # Please note: ... There is a special exception for API mobileview to allow dynamic # mobile...

Even good bots fight: The case of Wikipedia | PLOS ONE

In recent years, there has been a huge increase in the number of bots online, varying from Web crawlers for search engines, to...

Bots - Official TF2 Wiki

Also, killing a bot with a Strange weapon will not increase the weapon's kill count, unless the weapon has a Strange Part: Robots...