Scrapy chokes on HTTP response status lines without a Reason phrase
See original GitHub issueTry fetch page:
$ scrapy fetch 'http://www.gidroprofmontag.ru/bassein/sbornue_basseynu'
output:
2013-07-11 09:15:37+0400 [scrapy] INFO: Scrapy 0.17.0-304-g3fe2a32 started (bot: amon)
/home/tonal/amon/amon/amon/downloadermiddleware/blocked.py:6: ScrapyDeprecationWarning: Module `scrapy.stats` is deprecated, use `crawler.stats` attribute instead
from scrapy.stats import stats
2013-07-11 09:15:37+0400 [amon_ra] INFO: Spider opened
2013-07-11 09:15:37+0400 [amon_ra] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-11 09:15:37+0400 [amon_ra] ERROR: Error downloading <GET http://www.gidroprofmontag.ru/bassein/sbornue_basseynu>: [<twisted.python.failure.Failure <class 'scrapy.xlib.tx._newclient.ParseError'>>]
2013-07-11 09:15:37+0400 [amon_ra] INFO: Closing spider (finished)
2013-07-11 09:15:37+0400 [amon_ra] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.xlib.tx._newclient.ResponseFailed': 1,
'downloader/request_bytes': 256,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 11, 5, 15, 37, 512010),
'log_count/ERROR': 1,
'log_count/INFO': 4,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2013, 7, 11, 5, 15, 37, 257898)}
2013-07-11 09:15:37+0400 [amon_ra] INFO: Spider closed (finished)
Issue Analytics
- State:
- Created 10 years ago
- Comments:37 (24 by maintainers)
Top Results From Across the Web
Scrapy: HTTP status code is not handled or not allowed?
I wanted to crawl one website, which worked totally fine from my home PC, but did not respond at all (not even 404)...
Read more >Sending Request Headers With Scrapy Spider To Avoid 403 ...
Contribution should be done in a form of open Pull Request to solve a Scrapy chokes on HTTP response status lines without a...
Read more >Source code for scrapy.http.response.text
This module implements the TextResponse class which adds encoding handling and discovering (through HTTP headers) to base Response class.
Read more >Untitled
Nathan dunlap chuck e cheese killer, Imagen de epanadiplosis, Calleigh duquesne wikia, Taisuke vs issei final, Michigan state code drivers license, ...
Read more >Erle Robotics Python Gitbook Free
get started with drones without risking a hand? ... You can check return code and error codes and generally drive yourself crazy. Your....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Basically Scrapy Ignores 404 Error by Default, It was defined in httperror middleware.
So, Add HTTPERROR_ALLOW_ALL = True to your settings file.
After this you can access response.status through your parse function.
Twisted has a patch ready to fix this issue: https://twistedmatrix.com/trac/ticket/7673#comment:5 PR https://github.com/twisted/twisted/pull/723 🎉