question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dump stats to log periodically, not only at the end of the crawl

See original GitHub issue

It is useful to check Scrapy stats as spider runs, but there is no a built-in way to do that. What do you think about adding DUMP_STATS_INTERVAL option and outputting current stats to logs each DUMP_STATS_INTERVAL seconds?

Another related proposal (sorry for putting them all into this ticket) is to add more logging to Downloader and log periodically a number of pages Downloader is currently trying to fetch (MONITOR_DOWNLOADS_INTERVAL option). Checking that helps to understand what is crawler doing - is it busy downloading data or not.

Implementation draft:

import logging
import pprint

from twisted.internet.task import LoopingCall
from scrapy import signals

logger = logging.getLogger(__name__)


class _LoopingExtension:
    def setup_looping_task(self, task, crawler, interval):
        self._interval = interval
        self._task = LoopingCall(task)
        crawler.signals.connect(self.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(self.spider_closed, signal=signals.spider_closed)

    def spider_opened(self):
        self._task.start(self._interval, now=False)

    def spider_closed(self):
        if self._task.running:
            self._task.stop()


class MonitorDownloadsExtension(_LoopingExtension):
    """
    Enable this extension to periodically log a number of active downloads.
    """
    def __init__(self, crawler, interval):
        self.crawler = crawler
        self.setup_looping_task(self.monitor, crawler, interval)

    @classmethod
    def from_crawler(cls, crawler):
        # fixme: 0 should mean NotConfigured
        interval = crawler.settings.getfloat("MONITOR_DOWNLOADS_INTERVAL", 10.0)
        return cls(crawler, interval)

    def monitor(self):
        active_downloads = len(self.crawler.engine.downloader.active)
        logger.info("Active downloads: {}".format(active_downloads))


class DumpStatsExtension(_LoopingExtension):
    """
    Enable this extension to log Scrapy stats periodically, not only
    at the end of the crawl.
    """
    def __init__(self, crawler, interval):
        self.stats = crawler.stats
        self.setup_looping_task(self.print_stats, crawler, interval)

    def print_stats(self):
        stats = self.stats.get_stats()
        logger.info("Scrapy stats:\n" + pprint.pformat(stats))

    @classmethod
    def from_crawler(cls, crawler):
        interval = crawler.settings.getfloat("DUMP_STATS_INTERVAL", 60.0)
        # fixme: 0 should mean NotConfigured
        return cls(crawler, interval)

To get a feel on how it works copy-paste the code above to a project (e.g. to myproject/extensions.py) file, then add them to EXTENSIONS in settings.py:

EXTENSIONS = {
    'myproject.extensions.MonitorDownloadsExtension': 100,
    'myproject.extensions.DumpStatsExtension': 101,
}

Issue Analytics

  • State:open
  • Created 7 years ago
  • Reactions:17
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
lorenzfischercommented, Nov 25, 2020

I’m a bit late to the party here, but I got this done by using a custom middleware. I hope this helps!

class StatsPrinter(object):

    def __init__(self, user_agent=''):
        self._last_print = datetime.datetime.now()
        self._reporting_interval_secs = 60

    def process_request(self, request, spider):
        if (datetime.datetime.now() - self._last_print).total_seconds() > self._reporting_interval_secs:
            total_mb_out = spider.crawler.stats.get_value('downloader/request_bytes') / 8 / 1024 / 1024
            total_mb_in = spider.crawler.stats.get_value('downloader/response_bytes') / 8 / 1024 / 1024
            logging.info("Data transferred: {:,.2f} MB (in: {:,.2f} MB, out: {:,.2f} MB)"
                         .format(total_mb_in + total_mb_out, total_mb_in, total_mb_out))
            logging.debug("Dumping Scrapy stats:\n" + pprint.pformat(spider.crawler.stats.get_stats()),
                          extra={'spider': spider})
            self._last_print = datetime.datetime.now()

        return None  # return None, so there is no impact on the processing
0reactions
Gallaeciocommented, May 4, 2022

@lavoie005 Please, use StackOverflow to ask questions, not existing, unrelated tickets. See Getting Help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Is a "dump stat" what you put all your points into, or ... - Reddit
It's a stat you can leave in the dumps, but admittedly that's an understandably confusing phrase.
Read more >
apify - Incomplete datasets - log shows nothing - Stack Overflow
Now. My problem is that the crawler stats + the logs, seems to indicate that all urls are crawled. Problem is they are...
Read more >
Dump Stat - TV Tropes
A Dump Stat is a phenomenon that occurs in games that involve multiple attributes for your characters and allow you to customize those...
Read more >
Crawl Stats report - Search Console Help
Known issue: The Crawl Stats report currently reports most crawl requests, but some requests might not be counted for various reasons. We expect...
Read more >
Dump Stats? - Larian Studios forums
I mean, I don't make Rogues with 8 Dex, or Clerics with 8 Wis, but sometimes I will dump a stat of secondary...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found