question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Better API for creating requests from responses

See original GitHub issue

Sometime Request needs information about the response to be sent correctly. There are at least 2 use cases:

  1. URL encoding depends on response encoding (see https://github.com/scrapy/scrapy/pull/1923);
  2. relative URLs should be resolved based on response.url or its base url (see https://github.com/scrapy/scrapy/issues/548).

I think the current API is not good enough. The most obvious code is not correct:

for url in response.xpath("//a/@href").extract():
    yield scrapy.Request(url, self.parse)

To do that correctly user has to write the following:

for url in response.xpath("//a/@href").extract():
    yield scrapy.Request(response.urljoin(url), self.parse, encoding=response.encoding)

Or this:

for link in LinkExtractor().extract_links(response):
    yield scrapy.Request(link.url, self.parse, encoding=response.encoding)

LinkeExtractor solution has gotchas, e.g. canonicalize_url is called by default and fragments are removed. It means that e.g. Ajax crawlable URLs are not handled (no escaped_fragment even if a website supports it); it also makes it harder to use Scrapy with scrapy-splash which can handle fragments.

This all is too easy to get wrong; I think just documenting these gotchas is not good enough for a framework - it should make the easiest way to write something the correct way. IMHO in the API shouldn’t require user to instantiate weird objects or pass response encoding:

for url in response.xpath("//a/@href").extract():
    something.send_request(url, self.parse)

This can be implemented if we provide a method on Response to send new requests.

A related use case is async def functions or methods (https://github.com/scrapy/scrapy/issues/1144#issuecomment-141843616): it is not possible to yield Requests in async def functions, so adding a request should be either a method of self or a method of response if we want to support async def callbacks.

FormRequest.from_response(response, ...) can also be written as something like response.submit(...).

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:3
  • Comments:16 (16 by maintainers)

github_iconTop GitHub Comments

2reactions
kmikecommented, Jan 30, 2017

So, wishlist for response.follow:

  • support relative and absolute URLs;
  • set encoding correctly;
  • support Selector objects with a elements (what about other elements? img src?);
  • support Link objects;
  • support all Request options.
1reaction
kmikecommented, Jan 30, 2017

not sure what are the disadvantages you refer to 😃

As a newcomer, a very easy mistake to make is to write

for a in response.css("a.my-link"):
    response.follow(a, self.parse)

instead of

for a in response.css("a.my-link"):
    yield response.follow(a, self.parse)

With yield scrapy.Request it is clear we’re creating a Request object, but not necessarily executing an action. response.follow reads like an action, but it does nothing on its own. But I guess it applies to all async APIs, so people can get used to this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Best Tools for Working with API Requests - Zapier
The Best Tools for Working with API Requests · cURL: Command Line Granddaddy · HTTPie: Intuitive cURL Alternative Made for APIs · Hurl.it:...
Read more >
Creating API requests and handling responses - Google Cloud
This document describes how to construct API requests and handle API responses from the Compute Engine API. It covers how to: Construct a...
Read more >
API Requests and Responses - Beeswax
The Buzz API responds to every request with an http status indicating whether the request was successful, along with a json response.
Read more >
Define a Typical Request and Response - OpenClassrooms
You already know that a REST API involves sending requests to the client and getting responses from the server. But what do these...
Read more >
Best practices for REST API design - Stack Overflow Blog
REST APIs should accept JSON for request payload and also send responses to JSON. JSON is the standard for transferring data.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found