question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Access Response object from the Scrapy Pipeline

See original GitHub issue

Hello,

I’d like to implement a pipeline which compares a checksum of the current item with some old values in cache and drops the items that are not modified. Looks like this is a standard use-case for change detection systems (e.g. price monitoring).

Right now I put the cache (old checksums) into the Request.meta field, but there is no way to access it inside the pipeline.

As a workaround I did the same trick with the Spider Middleware (process_spider_output), but now I have another problem: when I want to skip processing of some item and signalize other components about that with Exception (like DropItem), to catch it with the spider_error signal, I can’t suppress the default exception handler.

So every time I get some tracebacks in the log (I don’t want them because I have my own spider exception handler).

This is because of that code in scrapy/core/scraper.py:

def handle_spider_error(self, _failure, request, response, spider):
    exc = _failure.value
    if isinstance(exc, CloseSpider):
        self.crawler.engine.close_spider(spider, exc.reason or 'cancelled')
        return
    referer = request.headers.get('Referer')
    logger.error(
        "Spider error processing %(request)s (referer: %(referer)s)",
        {'request': request, 'referer': referer},
        exc_info=failure_to_exc_info(_failure),
        extra={'spider': spider}
    )
    self.signals.send_catch_log(
        signal=signals.spider_error,
        failure=_failure, response=response,
        spider=spider
    )
    ...other code here...

It seems there is no way to suppress Spider Exceptions before the signal execution (even if I have a handler).

I know that I can implement a custom crawler signal (or use the item_dropped) and execute it from my middleware (as a workaround), but won’t it be easier to access the Response object from the Pipeline handlers?

I can try to make a patch which will add the additional parameter to the pipeline handler (without breaking compatibility with an old code), but first I’d like to discuss it with the community.

Is it conceptually wrong to access the response object from the pipeline?

P.S. Another question outside the topic - what if we want to suppress spider exceptions that differs from the CloseSpider exception?

Issue Analytics

  • State:open
  • Created 7 years ago
  • Reactions:2
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jacob1237commented, Dec 14, 2016

Well, looks like it’s not true. At least there is no direct way to produce such “memory leak”. Here is the code from scrapy/core/scraper.py:

def _process_spidermw_output(self, output, request, response, spider):
    """Process each Request/Item (given in the output parameter) returned
    from the given spider
    """
    if isinstance(output, Request):
        self.crawler.engine.crawl(request=output, spider=spider)
    elif isinstance(output, (BaseItem, dict)):
        self.slot.itemproc_size += 1
        dfd = self.itemproc.process_item(output, spider)
        dfd.addBoth(self._itemproc_finished, output, response, spider)
        return dfd
    <other code...>

You can see that self.itemproc.process_item (Deferred) is invoked without response, but there is also a callback assigned to that deferred, called self._itemproc_finished, which is invoked only AFTER pipeline processing (by Twisted). And yes, it has a response object as a parameter.

Also, this method (self._itemproc_finished) produces item_scraped signal. It means that the request lifecycle is longer than you think in your message. So, for example, I can produce the same “memory leak” from any Extension that is listening for item_scraped signal.

The only disadvantage I see is that if I’ll add the response object to the self.itemproc.process_item, it will be inconsistent with similar function signatures: the response object will be passed in the end of the arguments list in order to maintain compatibility with an old code.

1reaction
IAlwaysBeCodingcommented, Dec 14, 2016

If you do have the response being passed to any Item Pipeline , then you will have a very long memory leak. Remember that every Response has an associated Request attach on the request attribute. Those responses will have to be maintained and not discarded in order for them to be alive when you pass them through the pipeline. At the moment as soon as they are done passing through the Spider Middlewares then it takes a while after that for the garbage collector to come and pick them up.

That is a really bad design decision to hold on to the Response objects just so you can pass it on a Item Pipeline.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Requests and Responses — Scrapy 2.7.1 documentation
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the ...
Read more >
Access response from spider in items pipeline in scrapy
I think what you need is a middleware, not a pipeline. A middleware can access requests and responses, read this: doc.scrapy.org/en/latest/ ...
Read more >
Scrapy - Requests and Responses - Tutorialspoint
Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the...
Read more >
Scrapy - Item Pipeline - GeeksforGeeks
Spider object which is opened and a reference to self object are the parameters. ( These are default cases of python language). Returns...
Read more >
Requests and Responses - Scrapy documentation
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found