Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

canonicalize_url in linkextractor: not what browsers do

See original GitHub issue

By default, the link extractor calls canonicalize_url on the collected links. The following is not what browsers do:

canonicalize_url('http://example.com/index.php?/a/=/o/')
'http://example.com/index.php?%2Fa%2F=%2Fo%2F'  # encoding forward slashes
canonicalize_url('http://example.com/index.php?a')
'http://example.com/index.php?a='  # appending = on empty arguments

I doubt this is a problem in canonicalize_url because it’s not meant to mimic browsers in the first place, is it?

However this is a problem for the link extractor because it can potentially end up extracting urls that are wrong from the server’s perspective. In this example, the server doesn’t recognise the extractor’s url, only the browser’s:

# http://forum.laptop.bg/index.php?/discover/
LinkExtractor(restrict_xpaths=('//a[contains(@href, "/topic")]',)).extract_links(response)[0].url
# Extractor: http://forum.laptop.bg/index.php?%2Ftopic%2F57339-%D0%BB%D0%B0%D0%BF%D1%82%D0%BE%D0%BF-asus-w90vp%2F=&comment=221153&do=findComment
# Browser:   http://forum.laptop.bg/index.php?/topic/57339-%D0%BB%D0%B0%D0%BF%D1%82%D0%BE%D0%BF-asus-w90vp/&do=findComment&comment=221153

Was this a design decision or a bug?

Issue Analytics

State:
Created 7 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

redapplecommented, Apr 20, 2016

FWIW, regarding percent-escaping of /, for a link as

<a href="http://localhost:8001/query?a=/&b=?&c=@&d=:">/query?a=/&b=?&c=@&d=:</a>

this is what Chrome (Version 50.0.2661.75 (64-bit)) and Firefox (45.0.2) request (as received by HTTP server)

------------------------------------------------------------------------------------------------------------------------
Chrome
127.0.0.1 - - [20/Apr/2016 18:05:57] "GET /query?a=/&b=?&c=@&d=: HTTP/1.1" 200 -
------------------------------------------------------------------------------------------------------------------------
Firefox
127.0.0.1 - - [20/Apr/2016 18:07:15] "GET /query?a=/&b=?&c=@&d=: HTTP/1.1" 200 -
------------------------------------------------------------------------------------------------------------------------

The encoding actually happens in Python’s urlencode():

Python2, with urlencode() having no safe arg,

$ python2
Python 2.7.10 (default, Oct 14 2015, 16:09:02) 
[GCC 5.2.1 20151010] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> urllib.urlencode([('a', '/'), ('b', '?'), ('c', '@'), ('d', ':')])
'a=%2F&b=%3F&c=%40&d=%3A'

Python3, where urlencode() has safe so it’s easier to get closer to browsers:

$ python3
Python 3.4.3+ (default, Oct 14 2015, 16:03:50) 
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.parse
>>> urllib.parse.urlencode([('a', '/'), ('b', '?'), ('c', '@'), ('d', ':')])
'a=%2F&b=%3F&c=%40&d=%3A'
>>> urllib.parse.urlencode([('a', '/'), ('b', '?'), ('c', '@'), ('d', ':')], safe='/?:@')
'a=/&b=?&c=@&d=:'

0reactions

kmikecommented, Apr 21, 2016

No, I think the issue is valid.

Top Results From Across the Web

Link Extractors — Scrapy 2.7.1 documentation

Note that canonicalize_url is meant for duplicate checking; it can change the URL visible at server side, so the response can be different...

Python Scrapy Xpath not following url - Stack Overflow

Python Scrapy Xpath not following url ; import Rule from ; 'https://my.une.edu.au/courses/'] rules = Rule(LinkExtractor(canonicalize = True ; '//*' ...

Link Extractors - Scrapy documentation - Read the Docs

Defaults to ('a', 'area') . canonicalize (boolean) – canonicalize each extracted url (using scrapy. utils. url.

Rcrawler source: R/LinkExtractor.R - Rdrr.io

Fetching process can be done by HTTP GET request or through webdriver (phantomjs) which simulate a real browser rendering. #' #' #' @param...

Rcrawler: Web Crawler and Scraper

ContentScraper(Url, HTmlText, browser, XpathPatterns, CssPatterns, ... (does not work with webdriver). URLlenlimit interger, Maximum URL ...