Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Access Response object from the Scrapy Pipeline

See original GitHub issue

Hello,

I’d like to implement a pipeline which compares a checksum of the current item with some old values in cache and drops the items that are not modified. Looks like this is a standard use-case for change detection systems (e.g. price monitoring).

Right now I put the cache (old checksums) into the Request.meta field, but there is no way to access it inside the pipeline.

As a workaround I did the same trick with the Spider Middleware (process_spider_output), but now I have another problem: when I want to skip processing of some item and signalize other components about that with Exception (like DropItem), to catch it with the spider_error signal, I can’t suppress the default exception handler.

So every time I get some tracebacks in the log (I don’t want them because I have my own spider exception handler).

This is because of that code in scrapy/core/scraper.py:

def handle_spider_error(self, _failure, request, response, spider):
    exc = _failure.value
    if isinstance(exc, CloseSpider):
        self.crawler.engine.close_spider(spider, exc.reason or 'cancelled')
        return
    referer = request.headers.get('Referer')
    logger.error(
        "Spider error processing %(request)s (referer: %(referer)s)",
        {'request': request, 'referer': referer},
        exc_info=failure_to_exc_info(_failure),
        extra={'spider': spider}
    )
    self.signals.send_catch_log(
        signal=signals.spider_error,
        failure=_failure, response=response,
        spider=spider
    )
    ...other code here...

It seems there is no way to suppress Spider Exceptions before the signal execution (even if I have a handler).

I know that I can implement a custom crawler signal (or use the item_dropped) and execute it from my middleware (as a workaround), but won’t it be easier to access the Response object from the Pipeline handlers?

I can try to make a patch which will add the additional parameter to the pipeline handler (without breaking compatibility with an old code), but first I’d like to discuss it with the community.

Is it conceptually wrong to access the response object from the pipeline?

P.S. Another question outside the topic - what if we want to suppress spider exceptions that differs from the CloseSpider exception?

Issue Analytics

State:
Created 7 years ago
Reactions:2
Comments:11 (5 by maintainers)

Top GitHub Comments

1reaction

jacob1237commented, Dec 14, 2016

Well, looks like it’s not true. At least there is no direct way to produce such “memory leak”. Here is the code from scrapy/core/scraper.py:

def _process_spidermw_output(self, output, request, response, spider):
    """Process each Request/Item (given in the output parameter) returned
    from the given spider
    """
    if isinstance(output, Request):
        self.crawler.engine.crawl(request=output, spider=spider)
    elif isinstance(output, (BaseItem, dict)):
        self.slot.itemproc_size += 1
        dfd = self.itemproc.process_item(output, spider)
        dfd.addBoth(self._itemproc_finished, output, response, spider)
        return dfd
    <other code...>

You can see that self.itemproc.process_item (Deferred) is invoked without response, but there is also a callback assigned to that deferred, called self._itemproc_finished, which is invoked only AFTER pipeline processing (by Twisted). And yes, it has a response object as a parameter.

Also, this method (self._itemproc_finished) produces item_scraped signal. It means that the request lifecycle is longer than you think in your message. So, for example, I can produce the same “memory leak” from any Extension that is listening for item_scraped signal.

The only disadvantage I see is that if I’ll add the response object to the self.itemproc.process_item, it will be inconsistent with similar function signatures: the response object will be passed in the end of the arguments list in order to maintain compatibility with an old code.

1reaction

IAlwaysBeCodingcommented, Dec 14, 2016

If you do have the response being passed to any Item Pipeline , then you will have a very long memory leak. Remember that every Response has an associated Request attach on the request attribute. Those responses will have to be maintained and not discarded in order for them to be alive when you pass them through the pipeline. At the moment as soon as they are done passing through the Spider Middlewares then it takes a while after that for the garbage collector to come and pick them up.

That is a really bad design decision to hold on to the Response objects just so you can pass it on a Item Pipeline.

Top Results From Across the Web

Requests and Responses — Scrapy 2.7.1 documentation

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the ...

Access response from spider in items pipeline in scrapy

I think what you need is a middleware, not a pipeline. A middleware can access requests and responses, read this: doc.scrapy.org/en/latest/ ...

Scrapy - Requests and Responses - Tutorialspoint

Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the...

Scrapy - Item Pipeline - GeeksforGeeks

Spider object which is opened and a reference to self object are the parameters. ( These are default cases of python language). Returns...

Requests and Responses - Scrapy documentation

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system...