Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy should handle "invalid" relative URLs better

See original GitHub issue

Currently Scrapy can’t extract links from http://scrapy.org/ page correctly because urls in page header are relative to a non-existing parent: ../download/, ../doc/, etc. Browsers resolve these links as http://scrapy.org/download/ and http://scrapy.org/doc/, while response.urljoin, urlparse.urljoin and our ink extractors resolve them as http://scrapy.org/../download/, etc. This results in 400 Bad Request responses.

urlparse.urljoin is not correct (or not modern) here. In the URL Living Standard for browsers it is said:

If buffer is “…”, remove url’s path’s last entry, if any, and then if c is neither “/” nor "", append the empty string to url’s path.

Issue Analytics

State:
Created 8 years ago
Comments:15 (10 by maintainers)

Top GitHub Comments

1reaction

kmikecommented, Apr 20, 2016

The most evil spider ever: looks innocent, but doesn’t work for multiple reasons

import scrapy

class ScrapySpider(scrapy.Spider):
    name = 'scrapyspider'

    def start_requests(self):
        yield scrapy.Request("http://scrapy.org", self.parse_main)

    def parse_main(self, response):
        for href in response.xpath("//a/@href").extract():
            yield scrapy.Request(response.urljoin(href), self.parse_link)

    def parse_link(self, response):
        print(response.url)

0reactions

nirvana-msucommented, Feb 16, 2020

I am also having issues due to urljoin behaving differently to modern browsers.

For example, all modern browsers tolerate extra slashes after http (see e.g. this discussion in curl).

Using urljoin on most recent Python 3.8 I get:

> urljoin('http://base.com', 'http:////other.com')
http://base.com//other.com

whereas browsers would load http://other.com. Would be great if it was possible to replicate the URL-handling of browsers.

UPDATE:

I’ve just tried scurl and, as you would expect, it handles this case correctly (i.e. same as Chrome). Would be great if Scrapy could adopt it, or any other library that wraps actual browser implementation.

Top Results From Across the Web

python - Avoid bad requests due to relative urls - Stack Overflow

Basically deep down, scrapy uses http://docs.python.org/2/library/urlparse.html#urlparse.urljoin for getting the next url by joining currenturl and url link ...

Requests and Responses — Scrapy 2.7.1 documentation

If the URL is invalid, a ValueError exception is raised. ... Requests with a higher priority value will execute earlier.

Command line tool — Scrapy 2.7.1 documentation

You use the scrapy tool from inside your projects to control and manage them. ... Some Scrapy commands (like crawl ) must be...

Settings — Scrapy 2.7.1 documentation

The value of SCRAPY_SETTINGS_MODULE should be in Python path syntax, ... You can explicitly override one (or more) settings using the -s (or ......

Release notes — Scrapy 1.8.3 documentation

You will need to upgrade scrapy-splash to a greater version for it to continue to work. ... handle (non-standard) relative sitemap URLs (issue...