Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy does not use a non-zero exit code when a scrape fails

See original GitHub issue

When invoking a Scrapy spider with e.g. scrapy crawl spidername -o output.csv and the spider fails for some reason (in our case, timeout to the HTTP server), the exit code is zero, giving subsequent steps in a shell script no way to check if the scrape completed successfully.

See my example here: https://gist.github.com/iandees/74f51fefeb758b57f5e7

Issue Analytics

State:
Created 8 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

2reactions

iandeescommented, May 15, 2015

I ended up using your examples to override the crawl command with this:

from scrapy.commands.crawl import Command as ExistingCrawlCommand
from scrapy.exceptions import UsageError


class Command(ExistingCrawlCommand):
    def run(self, args, opts):
        if len(args) < 1:
            raise UsageError()
        elif len(args) > 1:
            raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
        spname = args[0]

        crawler = self.crawler_process.create_crawler()
        spider = crawler.spiders.create(spname, **opts.spargs)
        crawler.crawl(spider)
        self.crawler_process.start()

        exception_count = crawler.stats.get_value('downloader/exception_count')
        if exception_count:
            self.exitcode = 1

So far this serves my needs, so I’m going to close this.

1reaction

kmikecommented, May 14, 2015

For me the existing behaviour makes sense. Errors with remote resources are inevitable - you can’t control remote servers. Say you crawled 100 pages and 1 of them failed - should it be reported in exit code? What if you’re extracting links from a webpage, and some of them are broken - is it an error with your script? Even if all requests failed it may still be the expected result.

What logic do you propose for setting a status code?

An allowed % of errors is task-specific; I think hardcoding some value (“if 10% requests failed return a non-zero status code”) is a bad idea. To check if a crawl was successful or not (according to what you consider successful) one can check spider stats.