question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy does not use a non-zero exit code when a scrape fails

See original GitHub issue

When invoking a Scrapy spider with e.g. scrapy crawl spidername -o output.csv and the spider fails for some reason (in our case, timeout to the HTTP server), the exit code is zero, giving subsequent steps in a shell script no way to check if the scrape completed successfully.

See my example here: https://gist.github.com/iandees/74f51fefeb758b57f5e7

Issue Analytics

  • State:closed
  • Created 8 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
iandeescommented, May 15, 2015

I ended up using your examples to override the crawl command with this:

from scrapy.commands.crawl import Command as ExistingCrawlCommand
from scrapy.exceptions import UsageError


class Command(ExistingCrawlCommand):
    def run(self, args, opts):
        if len(args) < 1:
            raise UsageError()
        elif len(args) > 1:
            raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
        spname = args[0]

        crawler = self.crawler_process.create_crawler()
        spider = crawler.spiders.create(spname, **opts.spargs)
        crawler.crawl(spider)
        self.crawler_process.start()

        exception_count = crawler.stats.get_value('downloader/exception_count')
        if exception_count:
            self.exitcode = 1

So far this serves my needs, so I’m going to close this.

1reaction
kmikecommented, May 14, 2015

For me the existing behaviour makes sense. Errors with remote resources are inevitable - you can’t control remote servers. Say you crawled 100 pages and 1 of them failed - should it be reported in exit code? What if you’re extracting links from a webpage, and some of them are broken - is it an error with your script? Even if all requests failed it may still be the expected result.

What logic do you propose for setting a status code?

An allowed % of errors is task-specific; I think hardcoding some value (“if 10% requests failed return a non-zero status code”) is a bad idea. To check if a crawl was successful or not (according to what you consider successful) one can check spider stats.

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - Return non-zero exit code when raising a scrapy. ...
The problem is that Scrapy does not use a non-zero exit code when a scrape fails. I managed to fix this behaviour by...
Read more >
How To Solve Scrapy 403 Unhandled or Forbidden Errors
In this guide, we walk through how to debug and solve Scrapy 403 Unhandled or Forbidden errors when web scraping or crawling.
Read more >
Settings — Scrapy 2.7.1 documentation
When you use Scrapy, you have to tell it which settings you're using. You can do this by using an environment variable, SCRAPY_SETTINGS_MODULE...
Read more >
Web Scraping with Scrapy and MongoDB
This tutorial covers how to write a Python web crawler using Scrapy to scrape and parse data and then ... If you don't...
Read more >
Scrapy Documentation
found on a page, so that even if some parts fail to be scraped, you can at least get ... -c code: evaluate...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found