Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Better API for creating requests from responses

See original GitHub issue

Sometime Request needs information about the response to be sent correctly. There are at least 2 use cases:

URL encoding depends on response encoding (see https://github.com/scrapy/scrapy/pull/1923);
relative URLs should be resolved based on response.url or its base url (see https://github.com/scrapy/scrapy/issues/548).

I think the current API is not good enough. The most obvious code is not correct:

for url in response.xpath("//a/@href").extract():
    yield scrapy.Request(url, self.parse)

To do that correctly user has to write the following:

for url in response.xpath("//a/@href").extract():
    yield scrapy.Request(response.urljoin(url), self.parse, encoding=response.encoding)

Or this:

for link in LinkExtractor().extract_links(response):
    yield scrapy.Request(link.url, self.parse, encoding=response.encoding)

LinkeExtractor solution has gotchas, e.g. canonicalize_url is called by default and fragments are removed. It means that e.g. Ajax crawlable URLs are not handled (no escaped_fragment even if a website supports it); it also makes it harder to use Scrapy with scrapy-splash which can handle fragments.

This all is too easy to get wrong; I think just documenting these gotchas is not good enough for a framework - it should make the easiest way to write something the correct way. IMHO in the API shouldn’t require user to instantiate weird objects or pass response encoding:

for url in response.xpath("//a/@href").extract():
    something.send_request(url, self.parse)

This can be implemented if we provide a method on Response to send new requests.

A related use case is async def functions or methods (https://github.com/scrapy/scrapy/issues/1144#issuecomment-141843616): it is not possible to yield Requests in async def functions, so adding a request should be either a method of self or a method of response if we want to support async def callbacks.

FormRequest.from_response(response, ...) can also be written as something like response.submit(...).

Issue Analytics

State:
Created 7 years ago
Reactions:3
Comments:16 (16 by maintainers)

Top GitHub Comments

2reactions

kmikecommented, Jan 30, 2017

So, wishlist for response.follow:

support relative and absolute URLs;
set encoding correctly;
support Selector objects with a elements (what about other elements? img src?);
support Link objects;
support all Request options.

1reaction

kmikecommented, Jan 30, 2017

not sure what are the disadvantages you refer to 😃

As a newcomer, a very easy mistake to make is to write

for a in response.css("a.my-link"):
    response.follow(a, self.parse)

instead of

for a in response.css("a.my-link"):
    yield response.follow(a, self.parse)

With yield scrapy.Request it is clear we’re creating a Request object, but not necessarily executing an action. response.follow reads like an action, but it does nothing on its own. But I guess it applies to all async APIs, so people can get used to this.