Integrate darkrho/scrapy-inline-requests
See original GitHub issueI think we haven’t created a ticket for this yet, we discussed integrating https://github.com/darkrho/scrapy-inline-requests a couple of weeks ago with @pablohoffman, @kmike and @nramirezuy.
I’m copying the whole discussion here if anyone else wants to join in:
From Pablo:
I’ve heard many people using (and speaking good things about) inline requests recently. Would you consider it a feature to include in 1.0?
From Mikhail:
I’d like to have something like inline-requests builtin; it even was one of the draft ideas for 2014 GSoC 😃 +1 to add it to Scrapy 1.0 if we have time for that.
Past me wrote the following in one of the emails:
It needs a bit more thought before becoming a part of Scrapy: there are no tests, I’m not fan of how callbacks are handled, and downsides of the ‘yield’ approach should be clearly documented - e.g. state inside the callback lives longer, and it could lead to increased memory usage; it is also unclear how does it work with on-disk queues. There are also some useful features that present in other alike libraries (e.g. adisp) but not in scrapy-inline-requests - for example, waiting for several requests to be executed in parallel: syntax could be resp1, resp2 = yield Request(url1), Request(url2)
From Nicolás:
I like the idea behind inline requests, but not the API of it. It kinda doesn’t fit the callback approach since it doesn’t work with a callback and you have to manage several requests with in a callback.
I would prefer to see something like: def callback: return chain_requests(request1, request2, request3)
and the callbacks handled normally.
From Mikhail:
Nicolás: I think the point of inline-requests is to allow writing code without callbacks and handle several related requests in a single function 😃 It is a common trick to “linearize” callbacks into a generator.
Callbacks + CPython reference counting (no PyPy) provide a nice approach to resource deallocation: if a variable is not referenced from outside then it is deallocated as soon as the callback exits, without invoking garbage collector. With generators if user writes “response1 = yield …; response2 = yield …” then these responses are kept alive, possibly for long. Even with “response = yield …; response = yield …” response is kept in memory longer than needed (if I’m not mistaken, until the second request finishes). One can write “del response”, but it’d be nice to have some clever solution for that.
From Julia:
I wouldn’t promote it as the preferred way for dealing with requests/responses because of the already mention issues. It’s not as flexible as using explicit callbacks (we should document that yielding a request with
callback
not beingNone
breaks the chain btw) and it’s hacky, debugging it is kind of hard.Still, it’s a really good helper for its primary use-case of downloading some additional page and handling errors (as opposed to using errbacks or downloading the page with another library) so I’d also like to include it in Scrapy.
NOTE: I didn’t mean that I breaks the chain as in raising an exception, just that it won’t wait for yielded requests if they have callbacks.
From Pablo:
I think we’re pretty much in agreement that it would be a nice feature for 1.0 (well, if we’re not gonna have python 3 … 😃. It needs to go with good documentation (explaining the downsides), tests and better error check (raising exception if it’s used with a request having a callback).
Shall we make a ticket for this?. I think there’s already enough content in this thread for one 😃.
/cc @darkrho
Issue Analytics
- State:
- Created 8 years ago
- Comments:23 (23 by maintainers)
Top GitHub Comments
It also may be useful to consider if we need anything else in addition to a linear
await Request()
. In my experience, inline-requests is mostly used in the following scenarios:meta
cleanly, so a request, or several sequential ones, are made in the same callback. This is straightforward.asyncio.gather
/asyncio.as_completed
etc., or it may require some special Scrapy support. See also #2600.While we didn’t discuss the current implementation ideas much, we already found multiple questions:
Some of these problems can be answered by sending the request directly to the downloader instead of scheduling it, though this leads to some other questions.
Also, some of these problems may affect inline-requests too, but a list of inline-requests limitations is quite long and we want to have something better as a replacement.