Errors occur when the crawler is closed
See original GitHub issueDescription
The following errors occur when the crawler is closed ‘NoneType’ object has no attribute ‘start_requests’
log:
{'finish_reason': 'response msg error 我们的系统检测到您网络中存在异常访问请求, url '
'https://weixin.sogou.com/weixin?type=2&s_from=input&query=mjl_tfsteel&ie=utf8&_sug_=y&_sug_type_=!',
'finish_time': datetime.datetime(2019, 8, 30, 9, 22, 3, 628463),
'memusage/max': 679272448,
'memusage/startup': 679272448,
'start_time': datetime.datetime(2019, 8, 30, 9, 22, 2, 54762)}
2019-08-30 17:22:03,628 [scrapy.core.engine] INFO: Spider closed (response msg error 我们的系统检测到您网络中存在异常访问请求, url https://weixin.sogou.com/weixin?type=2&s_from=input&query=mjl_tfsteel&ie=utf8&_sug_=y&_sug_type_=!)
2019-08-30 17:22:04,016 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
File "/home/user/.local/lib/python3.6/site-packages/scrapy/commands/crawl.py", line 58, in run
self.crawler_process.start()
File "/home/user/.local/lib/python3.6/site-packages/scrapy/crawler.py", line 293, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/home/user/.local/lib/python3.6/site-packages/twisted/internet/base.py", line 1272, in run
self.mainLoop()
File "/home/user/.local/lib/python3.6/site-packages/twisted/internet/base.py", line 1281, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "/home/user/.local/lib/python3.6/site-packages/twisted/internet/base.py", line 902, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/home/user/.local/lib/python3.6/site-packages/scrapy/utils/reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "/home/user/.local/lib/python3.6/site-packages/scrapy/core/engine.py", line 137, in _next_request
if self.spider_is_idle(spider) and slot.close_if_idle:
File "/home/user/.local/lib/python3.6/site-packages/scrapy/core/engine.py", line 189, in spider_is_idle
if self.slot.start_requests is not None:
builtins.AttributeError: 'NoneType' object has no attribute 'start_requests'
code
def start_requests(self):
for wechat_config in self.wechat_list:
wechat_name = wechat_config.name
variety = wechat_config.variety
search_allow_rule = wechat_config.search_allow_rule
wechat_id = wechat_config.wechat_id
wx_id = wechat_config.wx_id
url = "https://weixin.sogou.com/weixin?type=2&s_from=input&query={}&ie=utf8&_sug_=y&_sug_type_=".format(parse.quote(wechat_id))
self.headers = {
"Host": 'weixin.sogou.com',
"Upgrade-Insecure-Requests":'1',
"Accept":'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
"Referer": 'https://weixin.sogou.com/',
"Accept-Encoding":'gzip, deflate, br',
"Accept-Language":'en-US,en;q=0.9'
}
self.headers['User-Agent'] = random.choice(get_project_settings().get('MC_USER_AGENT'))
response = requests.get(url, headers=self.headers, cookies={}, timeout=300)
if str(response.content.decode('utf-8')).find(self.error_msg) > 0:
self.crawler.engine.close_spider(self, 'response msg error {}, url {}!'.format(self.error_msg, url))
return
Versions
Scrapy (1.6.0)
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (7 by maintainers)
Top Results From Across the Web
Resolve "ERROR : Internal Service Exception" in AWS Glue
Crawler internal service exceptions are sometimes caused by transient issues. Before you start troubleshooting, run the crawler again.
Read more >What Are Crawl Errors? Why Do Crawl Errors Matter?
Crawl errors are issues that crawlers encounter while trying to access your pages. They can be URL-specific or cause your entire website to...
Read more >Fix AdSense crawler issues - Google Help
There's an issue with your site's server. Sometimes when the ads crawler tries to access site content, the site's server is unable to...
Read more >What are crawl errors? • SEO for beginners • Yoast
Crawl errors occur when a search engine tries to reach a page on your website but fails. Let's shed some more light on...
Read more >Crawl Errors, Everything That You Need to Know
This happens when there is a problem finding a website's robots.txt file. Prior to Googlebot crawling a site, Googlebot will look at this...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Maybe we can look into improving error handling here, so the root issue is more obvious from the error message.
@Luokun2016 Please note that the docs for
CloseSpider
say it should be raised from a request callback. Also, the exception is not actually being raised, just declared, which has no effect on the method: