canonicalize at LinkExtractor works incorrectly
See original GitHub issueFirst: https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/lxmlhtml.py#L110 And then: https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/lxmlhtml.py#L40
So, when canonicalize = False
(current default value), it will run canonicalize_url
, but it shouldn’t:
:param canonicalize: canonicalize each extracted url (using w3lib.url.canonicalize_url)
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (7 by maintainers)
Top Results From Across the Web
Include Image src to LinkExtractor Scrapy CrawlSpider
Im working on crawling on a site and Im using LinkExtractor from scrapy to crawl links and determine their response status.
Read more >How to build Scrapy LinkExtractor with Parameters? - eduCBA
canonicalize url is used to convert the retrieved url to a standard format. Unique: This parameter is extracted when the links are repeated....
Read more >Link Extractors — Scrapy 2.7.1 documentation
A link extractor is an object that extracts links from responses. ... (bool) – canonicalize each extracted url (using w3lib.url.canonicalize_url).
Read more >Beginner's Guide to Finding and Fixing SEO Canonical Issues
Canonical issues caused by duplicate content are a really common SEO problem for websites. Having identical or very similar content on more ...
Read more >canonicalize at LinkExtractor works incorrectly - 读书笔记- 精通 ...
crydby的读书笔记:关于сanonicalize = True , 衔接上篇笔记,爬取迅读网失败UnicodeError:utf-8 codec can't decode 1.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It shouldn’t return canonicalized URLs even if unique=True though, but it should use canonicalized URLs for deduplication when unique is True.
There is wrong if you set
canonicalize = False
. I changed the code inutf_8.py
like this:and this is output: