Scrapy does not use a non-zero exit code when a scrape fails
See original GitHub issueWhen invoking a Scrapy spider with e.g. scrapy crawl spidername -o output.csv
and the spider fails for some reason (in our case, timeout to the HTTP server), the exit code is zero, giving subsequent steps in a shell script no way to check if the scrape completed successfully.
See my example here: https://gist.github.com/iandees/74f51fefeb758b57f5e7
Issue Analytics
- State:
- Created 8 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
python - Return non-zero exit code when raising a scrapy. ...
The problem is that Scrapy does not use a non-zero exit code when a scrape fails. I managed to fix this behaviour by...
Read more >How To Solve Scrapy 403 Unhandled or Forbidden Errors
In this guide, we walk through how to debug and solve Scrapy 403 Unhandled or Forbidden errors when web scraping or crawling.
Read more >Settings — Scrapy 2.7.1 documentation
When you use Scrapy, you have to tell it which settings you're using. You can do this by using an environment variable, SCRAPY_SETTINGS_MODULE...
Read more >Web Scraping with Scrapy and MongoDB
This tutorial covers how to write a Python web crawler using Scrapy to scrape and parse data and then ... If you don't...
Read more >Scrapy Documentation
found on a page, so that even if some parts fail to be scraped, you can at least get ... -c code: evaluate...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I ended up using your examples to override the crawl command with this:
So far this serves my needs, so I’m going to close this.
For me the existing behaviour makes sense. Errors with remote resources are inevitable - you can’t control remote servers. Say you crawled 100 pages and 1 of them failed - should it be reported in exit code? What if you’re extracting links from a webpage, and some of them are broken - is it an error with your script? Even if all requests failed it may still be the expected result.
What logic do you propose for setting a status code?
An allowed % of errors is task-specific; I think hardcoding some value (“if 10% requests failed return a non-zero status code”) is a bad idea. To check if a crawl was successful or not (according to what you consider successful) one can check spider stats.