question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scrapy spider fails to terminate after finishing web scrape

See original GitHub issue

I am running a spider with Scrapy but after it finishes crawling it can’t seem to terminate. Log stats just recursively report that it is scraping 0 pages/minute. When I try to quit with Ctrl-C, it fails to shut down gracefully and I have to quit forcefully with Ctrl-C again. Any clue what is happening?

After completing a scrape, I just get output like this:

2017-08-24 11:13:45 [scrapy.extensions.logstats] INFO: Crawled 60 pages (at 0 pages/min), scraped 54 items (at 0 items/min)

2017-08-24 11:14:45 [scrapy.extensions.logstats] INFO: Crawled 60 pages (at 0 pages/min), scraped 54 items (at 0 items/min)

2017-08-24 11:15:45 [scrapy.extensions.logstats] INFO: Crawled 60 pages (at 0 pages/min), scraped 54 items (at 0 items/min)

2017-08-24 11:16:45 [scrapy.extensions.logstats] INFO: Crawled 60 pages (at 0 pages/min), scraped 54 items (at 0 items/min)

2017-08-24 11:17:45 [scrapy.extensions.logstats] INFO: Crawled 60 pages (at 0 pages/min), scraped 54 items (at 0 items/min)

2017-08-24 11:18:45 [scrapy.extensions.logstats] INFO: Crawled 60 pages (at 0 pages/min), scraped 54 items (at 0 items/min)

2017-08-24 11:19:45 [scrapy.extensions.logstats] INFO: Crawled 60 pages (at 0 pages/min), scraped 54 items (at 0 items/min)

2017-08-24 11:20:45 [scrapy.extensions.logstats] INFO: Crawled 60 pages (at 0 pages/min), scraped 54 items (at 0 items/min)

2017-08-24 11:21:45 [scrapy.extensions.logstats] INFO: Crawled 60 pages (at 0 pages/min), scraped 54 items (at 0 items/min)

which continues indefinitely.

My spider goes to a page that contains a list of links over multiple pages. It visits the first page, extracts the links (using the request meta trick to pass some information along while following the link), and then goes to the next page of links.

A second parser extracts information from the individual pages.

I don’t see any error messages, and the job performs successfully; it just fails to end. This is a problem because I would like to use a script to call the job to run multiple times on different pages (same structure, different information), but the since the first job never finishes I can’t ever get to the next set of pages to scrape.

The parse(self, response) method yields two types of information.

  1. For each link on the page, visit the page to extract more information.

     request = scrapy.Request(item['url'], callback=self.parse_transcript)
     request.meta['item'] = item
     yield request
    
  2. If there is another page of links, get link and increment page number by 1 using regex.

     while data['count'] > 0:
         next_page = re.sub('(?<=page=)(\d+)', lambda x: str(int(x.group(0)) + 1), response.url) 
         yield Request(next_page)
    

I checked the engine status using the telnet extension. I’m not sure how to interpret this information though.

>>> est()
Execution engine status

time()-engine.start_time                        : 10746.1215799
engine.has_capacity()                           : False
len(engine.downloader.active)                   : 0
engine.scraper.is_idle()                        : False
engine.spider.name                              : transcripts
engine.spider_is_idle(engine.spider)            : False
engine.slot.closing                             : <Deferred at 0x10d8fda28>
len(engine.slot.inprogress)                     : 4
len(engine.slot.scheduler.dqs or [])            : 0
len(engine.slot.scheduler.mqs)                  : 0
len(engine.scraper.slot.queue)                  : 0
len(engine.scraper.slot.active)                 : 4
engine.scraper.slot.active_size                 : 31569
engine.scraper.slot.itemproc_size               : 0
engine.scraper.slot.needs_backout()             : False

I tried raising an exception to close the spider after it reached the end of the links, but this prematurely stopped the spider from being able to visit all of the links that were scrapped. Further, the engine still appeared to hang after closing the spider.

while data['count'] > 0:
    next_page = re.sub('(?<=page=)(\d+)', lambda x: str(int(x.group(0)) + 1), response.url)
    yield Request(next_page)
else:
    raise CloseSpider('End of transcript history has been reached.')

I also tried using the CLOSESPIDER_TIMEOUT extension, but to no avail. The spider appears to close properly, but the engine remains idling indefinitely.

2017-08-30 11:20:44 [scrapy.extensions.logstats] INFO: Crawled 48 pages (at 9 pages/min), scraped 42 items (at 9 items/min)

2017-08-30 11:23:44 [scrapy.extensions.logstats] INFO: Crawled 48 pages (at 0 pages/min), scraped 42 items (at 0 items/min)

2017-08-30 11:24:44 [scrapy.extensions.logstats] INFO: Crawled 48 pages (at 0 pages/min), scraped 42 items (at 0 items/min)

2017-08-30 11:25:44 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)

2017-08-30 11:25:44 [scrapy.extensions.logstats] INFO: Crawled 48 pages (at 0 pages/min), scraped 42 items (at 0 items/min)

2017-08-30 11:28:44 [scrapy.extensions.logstats] INFO: Crawled 48 pages (at 0 pages/min), scraped 42 items (at 0 items/min)

2017-08-30 11:29:44 [scrapy.extensions.logstats] INFO: Crawled 48 pages (at 0 pages/min), scraped 42 items (at 0 items/min)

2017-08-30 11:32:44 [scrapy.extensions.logstats] INFO: Crawled 48 pages (at 0 pages/min), scraped 42 items (at 0 items/min)

^C2017-08-30 11:33:31 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force

2017-08-30 11:41:44 [scrapy.extensions.logstats] INFO: Crawled 48 pages (at 0 pages/min), scraped 42 items (at 0 items/min)

^C2017-08-30 11:45:52 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
IAlwaysBeCodingcommented, Sep 25, 2017

Yeah, no problem I’m glad you got it working.

1reaction
kmikecommented, Sep 4, 2017

Could you try enabling DEBUG log level? Often this happens when there is a non-responding website, and Scrapy makes a lot of retries. It is also helpful to use MonitorDownloads extension from https://github.com/scrapy/scrapy/issues/2173 to see how many files are there in progress.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy spider fails to terminate after finishing web scrape
When I try to quit with Ctrl-C, it fails to shut down gracefully and I have to quit forcefully with Ctrl-C again. Any...
Read more >
Broad Crawls — Scrapy 2.7.1 documentation
Retrying failed HTTP requests can slow down the crawls substantially, specially when sites causes are very slow (or fail) to respond, thus ...
Read more >
Web scraping using Python and Scrapy - GitHub Pages
The current version of Scrapy apparently only expects URLs without http:// when running scrapy genspider . If you do include the http prefix, ......
Read more >
Advanced Python Web Scraping Tactics - Pluralsight
Scrapy : Scrapy is a web crawling framework that provides a complete tool for scraping. In Scrapy, we create Spiders which are python...
Read more >
The Scrapy Playwright Guide - ScrapeOps
Then check out ScrapeOps, the complete toolkit for web scraping. ... Now, let's integrate scrapy-playwright into a Scrapy spider so all our requests...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found