Dump stats to log periodically, not only at the end of the crawl
See original GitHub issueIt is useful to check Scrapy stats as spider runs, but there is no a built-in way to do that. What do you think about adding DUMP_STATS_INTERVAL
option and outputting current stats to logs each DUMP_STATS_INTERVAL
seconds?
Another related proposal (sorry for putting them all into this ticket) is to add more logging to Downloader and log periodically a number of pages Downloader is currently trying to fetch (MONITOR_DOWNLOADS_INTERVAL
option). Checking that helps to understand what is crawler doing - is it busy downloading data or not.
Implementation draft:
import logging
import pprint
from twisted.internet.task import LoopingCall
from scrapy import signals
logger = logging.getLogger(__name__)
class _LoopingExtension:
def setup_looping_task(self, task, crawler, interval):
self._interval = interval
self._task = LoopingCall(task)
crawler.signals.connect(self.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(self.spider_closed, signal=signals.spider_closed)
def spider_opened(self):
self._task.start(self._interval, now=False)
def spider_closed(self):
if self._task.running:
self._task.stop()
class MonitorDownloadsExtension(_LoopingExtension):
"""
Enable this extension to periodically log a number of active downloads.
"""
def __init__(self, crawler, interval):
self.crawler = crawler
self.setup_looping_task(self.monitor, crawler, interval)
@classmethod
def from_crawler(cls, crawler):
# fixme: 0 should mean NotConfigured
interval = crawler.settings.getfloat("MONITOR_DOWNLOADS_INTERVAL", 10.0)
return cls(crawler, interval)
def monitor(self):
active_downloads = len(self.crawler.engine.downloader.active)
logger.info("Active downloads: {}".format(active_downloads))
class DumpStatsExtension(_LoopingExtension):
"""
Enable this extension to log Scrapy stats periodically, not only
at the end of the crawl.
"""
def __init__(self, crawler, interval):
self.stats = crawler.stats
self.setup_looping_task(self.print_stats, crawler, interval)
def print_stats(self):
stats = self.stats.get_stats()
logger.info("Scrapy stats:\n" + pprint.pformat(stats))
@classmethod
def from_crawler(cls, crawler):
interval = crawler.settings.getfloat("DUMP_STATS_INTERVAL", 60.0)
# fixme: 0 should mean NotConfigured
return cls(crawler, interval)
To get a feel on how it works copy-paste the code above to a project (e.g. to myproject/extensions.py) file, then add them to EXTENSIONS in settings.py:
EXTENSIONS = {
'myproject.extensions.MonitorDownloadsExtension': 100,
'myproject.extensions.DumpStatsExtension': 101,
}
Issue Analytics
- State:
- Created 7 years ago
- Reactions:17
- Comments:6 (4 by maintainers)
Top Results From Across the Web
Is a "dump stat" what you put all your points into, or ... - Reddit
It's a stat you can leave in the dumps, but admittedly that's an understandably confusing phrase.
Read more >apify - Incomplete datasets - log shows nothing - Stack Overflow
Now. My problem is that the crawler stats + the logs, seems to indicate that all urls are crawled. Problem is they are...
Read more >Dump Stat - TV Tropes
A Dump Stat is a phenomenon that occurs in games that involve multiple attributes for your characters and allow you to customize those...
Read more >Crawl Stats report - Search Console Help
Known issue: The Crawl Stats report currently reports most crawl requests, but some requests might not be counted for various reasons. We expect...
Read more >Dump Stats? - Larian Studios forums
I mean, I don't make Rogues with 8 Dex, or Clerics with 8 Wis, but sometimes I will dump a stat of secondary...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I’m a bit late to the party here, but I got this done by using a custom middleware. I hope this helps!
@lavoie005 Please, use StackOverflow to ask questions, not existing, unrelated tickets. See Getting Help.