canonicalize_url in linkextractor: not what browsers do
See original GitHub issueBy default, the link extractor calls canonicalize_url on the collected links. The following is not what browsers do:
canonicalize_url('http://example.com/index.php?/a/=/o/')
'http://example.com/index.php?%2Fa%2F=%2Fo%2F' # encoding forward slashes
canonicalize_url('http://example.com/index.php?a')
'http://example.com/index.php?a=' # appending = on empty arguments
I doubt this is a problem in canonicalize_url because it’s not meant to mimic browsers in the first place, is it?
However this is a problem for the link extractor because it can potentially end up extracting urls that are wrong from the server’s perspective. In this example, the server doesn’t recognise the extractor’s url, only the browser’s:
# http://forum.laptop.bg/index.php?/discover/
LinkExtractor(restrict_xpaths=('//a[contains(@href, "/topic")]',)).extract_links(response)[0].url
# Extractor: http://forum.laptop.bg/index.php?%2Ftopic%2F57339-%D0%BB%D0%B0%D0%BF%D1%82%D0%BE%D0%BF-asus-w90vp%2F=&comment=221153&do=findComment
# Browser: http://forum.laptop.bg/index.php?/topic/57339-%D0%BB%D0%B0%D0%BF%D1%82%D0%BE%D0%BF-asus-w90vp/&do=findComment&comment=221153
Was this a design decision or a bug?
Issue Analytics
- State:
- Created 7 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Link Extractors — Scrapy 2.7.1 documentation
Note that canonicalize_url is meant for duplicate checking; it can change the URL visible at server side, so the response can be different...
Read more >Python Scrapy Xpath not following url - Stack Overflow
Python Scrapy Xpath not following url ; import Rule from ; 'https://my.une.edu.au/courses/'] rules = Rule(LinkExtractor(canonicalize = True ; '//*' ...
Read more >Link Extractors - Scrapy documentation - Read the Docs
Defaults to ('a', 'area') . canonicalize (boolean) – canonicalize each extracted url (using scrapy. utils. url.
Read more >Rcrawler source: R/LinkExtractor.R - Rdrr.io
Fetching process can be done by HTTP GET request or through webdriver (phantomjs) which simulate a real browser rendering. #' #' #' @param...
Read more >Rcrawler: Web Crawler and Scraper
ContentScraper(Url, HTmlText, browser, XpathPatterns, CssPatterns, ... (does not work with webdriver). URLlenlimit interger, Maximum URL ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
FWIW, regarding percent-escaping of
/
, for a link asthis is what Chrome (Version 50.0.2661.75 (64-bit)) and Firefox (45.0.2) request (as received by HTTP server)
The encoding actually happens in Python’s
urlencode()
:Python2, with
urlencode()
having nosafe
arg,Python3, where
urlencode()
hassafe
so it’s easier to get closer to browsers:No, I think the issue is valid.