Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Growing download latency of a CPU-heavy spider

See original GitHub issue

The issue manifests itself as a growing latency when the spider is relatively CPU-intensive and is sending a lot of requests. Here is an example python 3 spider, based on scrapy bench spider:

from urllib.parse import urlencode
import time

import scrapy
from scrapy.linkextractor import LinkExtractor
import logging


class _BenchSpider(scrapy.Spider):
    """A spider that follows all links"""
    name = 'follow'
    total = 10000
    show = 20
    baseurl = 'http://localhost:8998'
    link_extractor = LinkExtractor()
    custom_settings = {
        'LOG_LEVEL': 'INFO',
        'LOGSTATS_INTERVAL': 1,

        'CONCURRENT_REQUEST': 32,  # changed
        'CONCURRENT_REQUEST_PER_DOMAIN': 32,  # changed
    }

    def start_requests(self):
        self.t0 = time.time()
        qargs = {'total': self.total, 'show': self.show}
        url = '{}?{}'.format(self.baseurl, urlencode(qargs, doseq=1))
        return [scrapy.Request(url, dont_filter=True)]

    def parse(self, response):
        # add latency reporting
        logging.info('latency {:.2f} s after {:.0f} s'.format(
            response.meta['download_latency'], time.time() - self.t0))
        for link in self.link_extractor.extract_links(response):
            for i in range(5):  # added CPU work extracting items
                time.sleep(0.05)
                yield {'step': i}
            yield scrapy.Request(link.url, callback=self.parse)

Run server in one window: python -m scrapy.utils.benchserver Run spider: scrapy runspider cpu_spider.py Ovserve the output:

2018-04-02 19:29:50 [scrapy.core.engine] INFO: Spider opened
2018-04-02 19:29:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-02 19:29:50 [root] INFO: latency 0.01 s after 0 s
2018-04-02 19:29:51 [root] INFO: latency 0.11 s after 1 s
2018-04-02 19:29:51 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 180 pages/min), scraped 17 items (at 1020 items/min)
...
2018-04-02 19:30:02 [root] INFO: latency 0.94 s after 12 s
2018-04-02 19:30:02 [root] INFO: latency 0.94 s after 12 s
2018-04-02 19:30:03 [scrapy.extensions.logstats] INFO: Crawled 34 pages (at 360 pages/min), scraped 239 items (at 2340 items/min)
...
2018-04-02 19:30:57 [root] INFO: latency 8.41 s after 67 s
2018-04-02 19:31:02 [scrapy.extensions.logstats] INFO: Crawled 101 pages (at 480 pages/min), scraped 1318 items (at 5940 items/min)
...
2018-04-02 19:31:56 [root] INFO: latency 12.77 s after 126 s
2018-04-02 19:32:01 [scrapy.extensions.logstats] INFO: Crawled 141 pages (at 480 pages/min), scraped 2410 items (at 8280 items/min)
...
2018-04-02 19:34:36 [root] INFO: latency 20.37 s after 286 s
2018-04-02 19:34:47 [scrapy.extensions.logstats] INFO: Crawled 213 pages (at 480 pages/min), scraped 5470 items (at 12840 items/min)
...
2018-04-02 19:37:02 [root] INFO: latency 25.34 s after 431 s
2018-04-02 19:37:08 [scrapy.extensions.logstats] INFO: Crawled 261 pages (at 480 pages/min), scraped 8103 items (at 16260 items/min)
...
2018-04-02 19:47:57 [root] INFO: latency 40.99 s after 1087 s
2018-04-02 19:48:11 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 480 pages/min), scraped 20446 items (at 23820 items/min)
2018-04-02 19:48:32 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 0 pages/min), scraped 20835 items (at 23340 items/min)

So it seems that download latency (response.meta['download_latency']) is growing linearly, while the actual time the server spends on generating the response is very low.

To summarize, I expected download latency to remain reasonable and not to grow, since the spider is taking constant time to process items.

Issue Analytics

State:
Created 5 years ago
Reactions:5
Comments:16 (10 by maintainers)

Top GitHub Comments

3reactions

lopuhincommented, Apr 6, 2018

I think explanation of the problem and suggestions in the doc will already help a lot. But we also don’t have a good way for the user to diagnose the problem - ideally we should issue a warning when the I/O looped is blocked for a long time.

1reaction

lopuhincommented, Apr 3, 2018

Yes, I think this fits the topic of performance GSoC well, even if solution will only involve better monitoring and documentation.