Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy chokes on HTTP response status lines without a Reason phrase

See original GitHub issue

Try fetch page:

$ scrapy fetch 'http://www.gidroprofmontag.ru/bassein/sbornue_basseynu'

output:

2013-07-11 09:15:37+0400 [scrapy] INFO: Scrapy 0.17.0-304-g3fe2a32 started (bot: amon)
/home/tonal/amon/amon/amon/downloadermiddleware/blocked.py:6: ScrapyDeprecationWarning: Module `scrapy.stats` is deprecated, use `crawler.stats` attribute instead
  from scrapy.stats import stats
2013-07-11 09:15:37+0400 [amon_ra] INFO: Spider opened
2013-07-11 09:15:37+0400 [amon_ra] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-11 09:15:37+0400 [amon_ra] ERROR: Error downloading <GET http://www.gidroprofmontag.ru/bassein/sbornue_basseynu>: [<twisted.python.failure.Failure <class 'scrapy.xlib.tx._newclient.ParseError'>>]
2013-07-11 09:15:37+0400 [amon_ra] INFO: Closing spider (finished)
2013-07-11 09:15:37+0400 [amon_ra] INFO: Dumping Scrapy stats:
        {'downloader/exception_count': 1,
         'downloader/exception_type_count/scrapy.xlib.tx._newclient.ResponseFailed': 1,
         'downloader/request_bytes': 256,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2013, 7, 11, 5, 15, 37, 512010),
         'log_count/ERROR': 1,
         'log_count/INFO': 4,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2013, 7, 11, 5, 15, 37, 257898)}
2013-07-11 09:15:37+0400 [amon_ra] INFO: Spider closed (finished)

Issue Analytics

State:
Created 10 years ago
Comments:37 (24 by maintainers)

Top GitHub Comments

2reactions

mohanbecommented, Dec 4, 2018

Basically Scrapy Ignores 404 Error by Default, It was defined in httperror middleware.

So, Add HTTPERROR_ALLOW_ALL = True to your settings file.

After this you can access response.status through your parse function.

2reactions

rmaxcommented, Feb 28, 2017

Twisted has a patch ready to fix this issue: https://twistedmatrix.com/trac/ticket/7673#comment:5 PR https://github.com/twisted/twisted/pull/723 🎉

Top Results From Across the Web

Scrapy: HTTP status code is not handled or not allowed?

I wanted to crawl one website, which worked totally fine from my home PC, but did not respond at all (not even 404)...

Sending Request Headers With Scrapy Spider To Avoid 403 ...

Contribution should be done in a form of open Pull Request to solve a Scrapy chokes on HTTP response status lines without a...

Source code for scrapy.http.response.text

This module implements the TextResponse class which adds encoding handling and discovering (through HTTP headers) to base Response class.

Untitled

Nathan dunlap chuck e cheese killer, Imagen de epanadiplosis, Calleigh duquesne wikia, Taisuke vs issei final, Michigan state code drivers license, ...

Erle Robotics Python Gitbook Free

get started with drones without risking a hand? ... You can check return code and error codes and generally drive yourself crazy. Your....