Scrapy should handle "invalid" relative URLs better
See original GitHub issueCurrently Scrapy can’t extract links from http://scrapy.org/ page correctly because urls in page header are relative to a non-existing parent: ../download/
, ../doc/
, etc. Browsers resolve these links as http://scrapy.org/download/
and http://scrapy.org/doc/
, while response.urljoin, urlparse.urljoin and our ink extractors resolve them as http://scrapy.org/../download/
, etc. This results in 400 Bad Request responses.
urlparse.urljoin is not correct (or not modern) here. In the URL Living Standard for browsers it is said:
If buffer is “…”, remove url’s path’s last entry, if any, and then if c is neither “/” nor "", append the empty string to url’s path.
Issue Analytics
- State:
- Created 8 years ago
- Comments:15 (10 by maintainers)
Top Results From Across the Web
python - Avoid bad requests due to relative urls - Stack Overflow
Basically deep down, scrapy uses http://docs.python.org/2/library/urlparse.html#urlparse.urljoin for getting the next url by joining currenturl and url link ...
Read more >Requests and Responses — Scrapy 2.7.1 documentation
If the URL is invalid, a ValueError exception is raised. ... Requests with a higher priority value will execute earlier.
Read more >Command line tool — Scrapy 2.7.1 documentation
You use the scrapy tool from inside your projects to control and manage them. ... Some Scrapy commands (like crawl ) must be...
Read more >Settings — Scrapy 2.7.1 documentation
The value of SCRAPY_SETTINGS_MODULE should be in Python path syntax, ... You can explicitly override one (or more) settings using the -s (or ......
Read more >Release notes — Scrapy 1.8.3 documentation
You will need to upgrade scrapy-splash to a greater version for it to continue to work. ... handle (non-standard) relative sitemap URLs (issue...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The most evil spider ever: looks innocent, but doesn’t work for multiple reasons
I am also having issues due to
urljoin
behaving differently to modern browsers.For example, all modern browsers tolerate extra slashes after
http
(see e.g. this discussion in curl).Using
urljoin
on most recent Python 3.8 I get:whereas browsers would load
http://other.com
. Would be great if it was possible to replicate the URL-handling of browsers.UPDATE:
I’ve just tried scurl and, as you would expect, it handles this case correctly (i.e. same as Chrome). Would be great if Scrapy could adopt it, or any other library that wraps actual browser implementation.