question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy should handle "invalid" relative URLs better

See original GitHub issue

Currently Scrapy can’t extract links from http://scrapy.org/ page correctly because urls in page header are relative to a non-existing parent: ../download/, ../doc/, etc. Browsers resolve these links as http://scrapy.org/download/ and http://scrapy.org/doc/, while response.urljoin, urlparse.urljoin and our ink extractors resolve them as http://scrapy.org/../download/, etc. This results in 400 Bad Request responses.

urlparse.urljoin is not correct (or not modern) here. In the URL Living Standard for browsers it is said:

If buffer is “…”, remove url’s path’s last entry, if any, and then if c is neither “/” nor "", append the empty string to url’s path.

Issue Analytics

  • State:open
  • Created 8 years ago
  • Comments:15 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
kmikecommented, Apr 20, 2016

The most evil spider ever: looks innocent, but doesn’t work for multiple reasons

import scrapy

class ScrapySpider(scrapy.Spider):
    name = 'scrapyspider'

    def start_requests(self):
        yield scrapy.Request("http://scrapy.org", self.parse_main)

    def parse_main(self, response):
        for href in response.xpath("//a/@href").extract():
            yield scrapy.Request(response.urljoin(href), self.parse_link)

    def parse_link(self, response):
        print(response.url)
0reactions
nirvana-msucommented, Feb 16, 2020

I am also having issues due to urljoin behaving differently to modern browsers.

For example, all modern browsers tolerate extra slashes after http (see e.g. this discussion in curl).

Using urljoin on most recent Python 3.8 I get:

> urljoin('http://base.com', 'http:////other.com')
http://base.com//other.com

whereas browsers would load http://other.com. Would be great if it was possible to replicate the URL-handling of browsers.

UPDATE:

I’ve just tried scurl and, as you would expect, it handles this case correctly (i.e. same as Chrome). Would be great if Scrapy could adopt it, or any other library that wraps actual browser implementation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - Avoid bad requests due to relative urls - Stack Overflow
Basically deep down, scrapy uses http://docs.python.org/2/library/urlparse.html#urlparse.urljoin for getting the next url by joining currenturl and url link ...
Read more >
Requests and Responses — Scrapy 2.7.1 documentation
If the URL is invalid, a ValueError exception is raised. ... Requests with a higher priority value will execute earlier.
Read more >
Command line tool — Scrapy 2.7.1 documentation
You use the scrapy tool from inside your projects to control and manage them. ... Some Scrapy commands (like crawl ) must be...
Read more >
Settings — Scrapy 2.7.1 documentation
The value of SCRAPY_SETTINGS_MODULE should be in Python path syntax, ... You can explicitly override one (or more) settings using the -s (or ......
Read more >
Release notes — Scrapy 1.8.3 documentation
You will need to upgrade scrapy-splash to a greater version for it to continue to work. ... handle (non-standard) relative sitemap URLs (issue...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found