Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

canonicalize at LinkExtractor works incorrectly

See original GitHub issue

First: https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/lxmlhtml.py#L110 And then: https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/lxmlhtml.py#L40

So, when canonicalize = False (current default value), it will run canonicalize_url, but it shouldn’t:

:param canonicalize: canonicalize each extracted url (using w3lib.url.canonicalize_url)

Issue Analytics

State:
Created 6 years ago
Comments:8 (7 by maintainers)

Top GitHub Comments

1reaction

kmikecommented, Jun 7, 2017

It shouldn’t return canonicalized URLs even if unique=True though, but it should use canonicalized URLs for deduplication when unique is True.

0reactions

huangxuuuncommented, Nov 23, 2017

There is wrong if you set canonicalize = False. I changed the code in utf_8.py like this:

def decode(input, errors='strict'):
    print input, input.decode('gbk')
    return codecs.utf_8_decode(input, errors, True)

and this is output:

http://dota2.uuu9.com/url/dota2������Ѷ.url http://dota2.uuu9.com/url/dota2最新资讯.url
2017-11-23 16:23:13 [scrapy.core.scraper] ERROR: Spider error processing <GET http://dota2.uuu9.com/List/List_7598.shtml> (referer: http://dota2.uuu9.com)
Traceback (most recent call last):
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\spiders\crawl.py", line 82, in _parse_response
    for request_or_item in self._requests_to_follow(response):
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\spiders\crawl.py", line 61, in _requests_to_follow
    links = [lnk for lnk in rule.link_extractor.extract_links(response)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\linkextractors\lxmlhtml.py", line 128, in extract_links
    links = self._extract_links(doc, response.url, response.encoding, base_url)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\linkextractors\__init__.py", line 109, in _extract_links
    return self.link_extractor._extract_links(*args, **kwargs)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\linkextractors\lxmlhtml.py", line 76, in _extract_links
    return self._deduplicate_if_needed(links)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\linkextractors\lxmlhtml.py", line 91, in _deduplicate_if_needed
    return unique_list(links, key=self.link_key)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\utils\python.py", line 76, in unique
    seenkey = key(item)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\linkextractors\lxmlhtml.py", line 43, in <lambda>
    keep_fragments=True)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\w3lib\url.py", line 433, in canonicalize_url
    parse_url(url), encoding=encoding)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\w3lib\url.py", line 510, in parse_url
    return urlparse(to_unicode(url, encoding))
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\w3lib\util.py", line 27, in to_unicode
    return text.decode(encoding, errors)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\encodings\utf_8.py", line 17, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd7 in position 31: invalid continuation byte