question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

canonicalize_url in linkextractor: not what browsers do

See original GitHub issue

By default, the link extractor calls canonicalize_url on the collected links. The following is not what browsers do:

canonicalize_url('http://example.com/index.php?/a/=/o/')
'http://example.com/index.php?%2Fa%2F=%2Fo%2F'  # encoding forward slashes
canonicalize_url('http://example.com/index.php?a')
'http://example.com/index.php?a='  # appending = on empty arguments

I doubt this is a problem in canonicalize_url because it’s not meant to mimic browsers in the first place, is it?

However this is a problem for the link extractor because it can potentially end up extracting urls that are wrong from the server’s perspective. In this example, the server doesn’t recognise the extractor’s url, only the browser’s:

# http://forum.laptop.bg/index.php?/discover/
LinkExtractor(restrict_xpaths=('//a[contains(@href, "/topic")]',)).extract_links(response)[0].url
# Extractor: http://forum.laptop.bg/index.php?%2Ftopic%2F57339-%D0%BB%D0%B0%D0%BF%D1%82%D0%BE%D0%BF-asus-w90vp%2F=&comment=221153&do=findComment
# Browser:   http://forum.laptop.bg/index.php?/topic/57339-%D0%BB%D0%B0%D0%BF%D1%82%D0%BE%D0%BF-asus-w90vp/&do=findComment&comment=221153

Was this a design decision or a bug?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
redapplecommented, Apr 20, 2016

FWIW, regarding percent-escaping of /, for a link as

<a href="http://localhost:8001/query?a=/&b=?&c=@&d=:">/query?a=/&b=?&c=@&d=:</a>

this is what Chrome (Version 50.0.2661.75 (64-bit)) and Firefox (45.0.2) request (as received by HTTP server)

------------------------------------------------------------------------------------------------------------------------
Chrome
127.0.0.1 - - [20/Apr/2016 18:05:57] "GET /query?a=/&b=?&c=@&d=: HTTP/1.1" 200 -
------------------------------------------------------------------------------------------------------------------------
Firefox
127.0.0.1 - - [20/Apr/2016 18:07:15] "GET /query?a=/&b=?&c=@&d=: HTTP/1.1" 200 -
------------------------------------------------------------------------------------------------------------------------

The encoding actually happens in Python’s urlencode():

Python2, with urlencode() having no safe arg,

$ python2
Python 2.7.10 (default, Oct 14 2015, 16:09:02) 
[GCC 5.2.1 20151010] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> urllib.urlencode([('a', '/'), ('b', '?'), ('c', '@'), ('d', ':')])
'a=%2F&b=%3F&c=%40&d=%3A'

Python3, where urlencode() has safe so it’s easier to get closer to browsers:

$ python3
Python 3.4.3+ (default, Oct 14 2015, 16:03:50) 
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.parse
>>> urllib.parse.urlencode([('a', '/'), ('b', '?'), ('c', '@'), ('d', ':')])
'a=%2F&b=%3F&c=%40&d=%3A'
>>> urllib.parse.urlencode([('a', '/'), ('b', '?'), ('c', '@'), ('d', ':')], safe='/?:@')
'a=/&b=?&c=@&d=:'
0reactions
kmikecommented, Apr 21, 2016

No, I think the issue is valid.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Link Extractors — Scrapy 2.7.1 documentation
Note that canonicalize_url is meant for duplicate checking; it can change the URL visible at server side, so the response can be different...
Read more >
Python Scrapy Xpath not following url - Stack Overflow
Python Scrapy Xpath not following url ; import Rule from ; 'https://my.une.edu.au/courses/'] rules = Rule(LinkExtractor(canonicalize = True ; '//*' ...
Read more >
Link Extractors - Scrapy documentation - Read the Docs
Defaults to ('a', 'area') . canonicalize (boolean) – canonicalize each extracted url (using scrapy. utils. url.
Read more >
Rcrawler source: R/LinkExtractor.R - Rdrr.io
Fetching process can be done by HTTP GET request or through webdriver (phantomjs) which simulate a real browser rendering. #' #' #' @param...
Read more >
Rcrawler: Web Crawler and Scraper
ContentScraper(Url, HTmlText, browser, XpathPatterns, CssPatterns, ... (does not work with webdriver). URLlenlimit interger, Maximum URL ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found