Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dump stats to log periodically, not only at the end of the crawl

See original GitHub issue

It is useful to check Scrapy stats as spider runs, but there is no a built-in way to do that. What do you think about adding DUMP_STATS_INTERVAL option and outputting current stats to logs each DUMP_STATS_INTERVAL seconds?

Another related proposal (sorry for putting them all into this ticket) is to add more logging to Downloader and log periodically a number of pages Downloader is currently trying to fetch (MONITOR_DOWNLOADS_INTERVAL option). Checking that helps to understand what is crawler doing - is it busy downloading data or not.

Implementation draft:

import logging
import pprint

from twisted.internet.task import LoopingCall
from scrapy import signals

logger = logging.getLogger(__name__)


class _LoopingExtension:
    def setup_looping_task(self, task, crawler, interval):
        self._interval = interval
        self._task = LoopingCall(task)
        crawler.signals.connect(self.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(self.spider_closed, signal=signals.spider_closed)

    def spider_opened(self):
        self._task.start(self._interval, now=False)

    def spider_closed(self):
        if self._task.running:
            self._task.stop()


class MonitorDownloadsExtension(_LoopingExtension):
    """
    Enable this extension to periodically log a number of active downloads.
    """
    def __init__(self, crawler, interval):
        self.crawler = crawler
        self.setup_looping_task(self.monitor, crawler, interval)

    @classmethod
    def from_crawler(cls, crawler):
        # fixme: 0 should mean NotConfigured
        interval = crawler.settings.getfloat("MONITOR_DOWNLOADS_INTERVAL", 10.0)
        return cls(crawler, interval)

    def monitor(self):
        active_downloads = len(self.crawler.engine.downloader.active)
        logger.info("Active downloads: {}".format(active_downloads))


class DumpStatsExtension(_LoopingExtension):
    """
    Enable this extension to log Scrapy stats periodically, not only
    at the end of the crawl.
    """
    def __init__(self, crawler, interval):
        self.stats = crawler.stats
        self.setup_looping_task(self.print_stats, crawler, interval)

    def print_stats(self):
        stats = self.stats.get_stats()
        logger.info("Scrapy stats:\n" + pprint.pformat(stats))

    @classmethod
    def from_crawler(cls, crawler):
        interval = crawler.settings.getfloat("DUMP_STATS_INTERVAL", 60.0)
        # fixme: 0 should mean NotConfigured
        return cls(crawler, interval)

To get a feel on how it works copy-paste the code above to a project (e.g. to myproject/extensions.py) file, then add them to EXTENSIONS in settings.py:

EXTENSIONS = {
    'myproject.extensions.MonitorDownloadsExtension': 100,
    'myproject.extensions.DumpStatsExtension': 101,
}

Issue Analytics

State:
Created 7 years ago
Reactions:17
Comments:6 (4 by maintainers)

Top GitHub Comments

2reactions

lorenzfischercommented, Nov 25, 2020

I’m a bit late to the party here, but I got this done by using a custom middleware. I hope this helps!

class StatsPrinter(object):

    def __init__(self, user_agent=''):
        self._last_print = datetime.datetime.now()
        self._reporting_interval_secs = 60

    def process_request(self, request, spider):
        if (datetime.datetime.now() - self._last_print).total_seconds() > self._reporting_interval_secs:
            total_mb_out = spider.crawler.stats.get_value('downloader/request_bytes') / 8 / 1024 / 1024
            total_mb_in = spider.crawler.stats.get_value('downloader/response_bytes') / 8 / 1024 / 1024
            logging.info("Data transferred: {:,.2f} MB (in: {:,.2f} MB, out: {:,.2f} MB)"
                         .format(total_mb_in + total_mb_out, total_mb_in, total_mb_out))
            logging.debug("Dumping Scrapy stats:\n" + pprint.pformat(spider.crawler.stats.get_stats()),
                          extra={'spider': spider})
            self._last_print = datetime.datetime.now()

        return None  # return None, so there is no impact on the processing

0reactions

Gallaeciocommented, May 4, 2022

@lavoie005 Please, use StackOverflow to ask questions, not existing, unrelated tickets. See Getting Help.