question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

canonicalize at LinkExtractor works incorrectly

See original GitHub issue

First: https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/lxmlhtml.py#L110 And then: https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/lxmlhtml.py#L40

So, when canonicalize = False (current default value), it will run canonicalize_url, but it shouldn’t:

:param canonicalize: canonicalize each extracted url (using w3lib.url.canonicalize_url)

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
kmikecommented, Jun 7, 2017

It shouldn’t return canonicalized URLs even if unique=True though, but it should use canonicalized URLs for deduplication when unique is True.

0reactions
huangxuuuncommented, Nov 23, 2017

There is wrong if you set canonicalize = False. I changed the code in utf_8.py like this:

def decode(input, errors='strict'):
    print input, input.decode('gbk')
    return codecs.utf_8_decode(input, errors, True)

and this is output:

http://dota2.uuu9.com/url/dota2������Ѷ.url http://dota2.uuu9.com/url/dota2最新资讯.url
2017-11-23 16:23:13 [scrapy.core.scraper] ERROR: Spider error processing <GET http://dota2.uuu9.com/List/List_7598.shtml> (referer: http://dota2.uuu9.com)
Traceback (most recent call last):
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\spiders\crawl.py", line 82, in _parse_response
    for request_or_item in self._requests_to_follow(response):
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\spiders\crawl.py", line 61, in _requests_to_follow
    links = [lnk for lnk in rule.link_extractor.extract_links(response)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\linkextractors\lxmlhtml.py", line 128, in extract_links
    links = self._extract_links(doc, response.url, response.encoding, base_url)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\linkextractors\__init__.py", line 109, in _extract_links
    return self.link_extractor._extract_links(*args, **kwargs)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\linkextractors\lxmlhtml.py", line 76, in _extract_links
    return self._deduplicate_if_needed(links)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\linkextractors\lxmlhtml.py", line 91, in _deduplicate_if_needed
    return unique_list(links, key=self.link_key)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\utils\python.py", line 76, in unique
    seenkey = key(item)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\scrapy\linkextractors\lxmlhtml.py", line 43, in <lambda>
    keep_fragments=True)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\w3lib\url.py", line 433, in canonicalize_url
    parse_url(url), encoding=encoding)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\w3lib\url.py", line 510, in parse_url
    return urlparse(to_unicode(url, encoding))
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\site-packages\w3lib\util.py", line 27, in to_unicode
    return text.decode(encoding, errors)
  File "D:\Users\Administrator\AppData\Local\Programs\Python\Phthon27\lib\encodings\utf_8.py", line 17, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd7 in position 31: invalid continuation byte
Read more comments on GitHub >

github_iconTop Results From Across the Web

Include Image src to LinkExtractor Scrapy CrawlSpider
Im working on crawling on a site and Im using LinkExtractor from scrapy to crawl links and determine their response status.
Read more >
How to build Scrapy LinkExtractor with Parameters? - eduCBA
canonicalize url is used to convert the retrieved url to a standard format. Unique: This parameter is extracted when the links are repeated....
Read more >
Link Extractors — Scrapy 2.7.1 documentation
A link extractor is an object that extracts links from responses. ... (bool) – canonicalize each extracted url (using w3lib.url.canonicalize_url).
Read more >
Beginner's Guide to Finding and Fixing SEO Canonical Issues
Canonical issues caused by duplicate content are a really common SEO problem for websites. Having identical or very similar content on more ...
Read more >
canonicalize at LinkExtractor works incorrectly - 读书笔记- 精通 ...
crydby的读书笔记:关于сanonicalize = True , 衔接上篇笔记,爬取迅读网失败UnicodeError:utf-8 codec can't decode 1.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found