Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Integrate darkrho/scrapy-inline-requests

See original GitHub issue

I think we haven’t created a ticket for this yet, we discussed integrating https://github.com/darkrho/scrapy-inline-requests a couple of weeks ago with @pablohoffman, @kmike and @nramirezuy.

I’m copying the whole discussion here if anyone else wants to join in:

From Pablo:

I’ve heard many people using (and speaking good things about) inline requests recently. Would you consider it a feature to include in 1.0?

From Mikhail:

I’d like to have something like inline-requests builtin; it even was one of the draft ideas for 2014 GSoC 😃 +1 to add it to Scrapy 1.0 if we have time for that.

Past me wrote the following in one of the emails:

It needs a bit more thought before becoming a part of Scrapy: there are no tests, I’m not fan of how callbacks are handled, and downsides of the ‘yield’ approach should be clearly documented - e.g. state inside the callback lives longer, and it could lead to increased memory usage; it is also unclear how does it work with on-disk queues. There are also some useful features that present in other alike libraries (e.g. adisp) but not in scrapy-inline-requests - for example, waiting for several requests to be executed in parallel: syntax could be resp1, resp2 = yield Request(url1), Request(url2)

From Nicolás:

I like the idea behind inline requests, but not the API of it. It kinda doesn’t fit the callback approach since it doesn’t work with a callback and you have to manage several requests with in a callback.

I would prefer to see something like: def callback: return chain_requests(request1, request2, request3)

and the callbacks handled normally.

From Mikhail:

Nicolás: I think the point of inline-requests is to allow writing code without callbacks and handle several related requests in a single function 😃 It is a common trick to “linearize” callbacks into a generator.

Callbacks + CPython reference counting (no PyPy) provide a nice approach to resource deallocation: if a variable is not referenced from outside then it is deallocated as soon as the callback exits, without invoking garbage collector. With generators if user writes “response1 = yield …; response2 = yield …” then these responses are kept alive, possibly for long. Even with “response = yield …; response = yield …” response is kept in memory longer than needed (if I’m not mistaken, until the second request finishes). One can write “del response”, but it’d be nice to have some clever solution for that.

From Julia:

I wouldn’t promote it as the preferred way for dealing with requests/responses because of the already mention issues. It’s not as flexible as using explicit callbacks (we should document that yielding a request with callback not being None breaks the chain btw) and it’s hacky, debugging it is kind of hard.

Still, it’s a really good helper for its primary use-case of downloading some additional page and handling errors (as opposed to using errbacks or downloading the page with another library) so I’d also like to include it in Scrapy.

NOTE: I didn’t mean that I breaks the chain as in raising an exception, just that it won’t wait for yielded requests if they have callbacks.

From Pablo:

I think we’re pretty much in agreement that it would be a nice feature for 1.0 (well, if we’re not gonna have python 3 … 😃. It needs to go with good documentation (explaining the downsides), tests and better error check (raising exception if it’s used with a request having a callback).

Shall we make a ticket for this?. I think there’s already enough content in this thread for one 😃.

/cc @darkrho

Issue Analytics

State:
Created 8 years ago
Comments:23 (23 by maintainers)

Top GitHub Comments

1reaction

wRARcommented, Apr 27, 2021

It also may be useful to consider if we need anything else in addition to a linear await Request(). In my experience, inline-requests is mostly used in the following scenarios:

There is too much state to pass around in meta cleanly, so a request, or several sequential ones, are made in the same callback. This is straightforward.
A bunch of subpages contain data that needs to be gathered into a single item together with data from the main page. While this can be trivially implemented with a loop, subpage requests can also be submitted in parallel as they are usually independent. Depending on the processing logic, the processing could be even started when the first request returns while waiting for the others. It may be possible to implement this with async generators, asyncio.gather/asyncio.as_completed etc., or it may require some special Scrapy support. See also #2600.

1reaction

wRARcommented, Apr 27, 2021

While we didn’t discuss the current implementation ideas much, we already found multiple questions:

Should an exception be raised or returned in some form? Looks like raising is better for the API user.
How should an awaited request be serialized, put into a disk queue, saved with the spider state to resume the spider from it, etc.? Unlike a normal detached request that only needs to know what callback to call with what cb_kwargs when the response is available, an executed and paused callback that awaits for a request result would need to be serialized as a whole, which doesn’t seem reasonable.
How is the additional memory usage problem solved?
How should a request object with a set callback and/or errback be handled?
Should a priority of an awaited request be raised by Scrapy or not?

Some of these problems can be answered by sending the request directly to the downloader instead of scheduling it, though this leads to some other questions.

Also, some of these problems may affect inline-requests too, but a list of inline-requests limitations is quite long and we want to have something better as a replacement.

Top Results From Across the Web

scrapy-inline-requests - PyPI

A decorator for writing coroutine-like spider callbacks. Free software: MIT license. Documentation: https://scrapy-inline-requests.readthedocs.org.

Inline requests inside parse function in scrapy - Stack Overflow

Generate the item without the text · Pass the item as a meta for request url for which text needs to be collected...

Requests and Responses — Scrapy 2.7.1 documentation

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the ...

Simple index - piwheels

... nesting-nester-1 tolist hyperestraier simple-rule-engine searvey file-collection-hash django-nested-inline-py3 functional-python fast-dp habitual clade ...

Untitled

Voorzienbaarheid onrechtmatige daad, Sigma rhomeo probate! ... Fly black wedge zip boot, Djakarta warehouse project 2016, Michel hannequart grille blanche, ...