question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Growing download latency of a CPU-heavy spider

See original GitHub issue

The issue manifests itself as a growing latency when the spider is relatively CPU-intensive and is sending a lot of requests. Here is an example python 3 spider, based on scrapy bench spider:

from urllib.parse import urlencode
import time

import scrapy
from scrapy.linkextractor import LinkExtractor
import logging


class _BenchSpider(scrapy.Spider):
    """A spider that follows all links"""
    name = 'follow'
    total = 10000
    show = 20
    baseurl = 'http://localhost:8998'
    link_extractor = LinkExtractor()
    custom_settings = {
        'LOG_LEVEL': 'INFO',
        'LOGSTATS_INTERVAL': 1,

        'CONCURRENT_REQUEST': 32,  # changed
        'CONCURRENT_REQUEST_PER_DOMAIN': 32,  # changed
    }

    def start_requests(self):
        self.t0 = time.time()
        qargs = {'total': self.total, 'show': self.show}
        url = '{}?{}'.format(self.baseurl, urlencode(qargs, doseq=1))
        return [scrapy.Request(url, dont_filter=True)]

    def parse(self, response):
        # add latency reporting
        logging.info('latency {:.2f} s after {:.0f} s'.format(
            response.meta['download_latency'], time.time() - self.t0))
        for link in self.link_extractor.extract_links(response):
            for i in range(5):  # added CPU work extracting items
                time.sleep(0.05)
                yield {'step': i}
            yield scrapy.Request(link.url, callback=self.parse)

Run server in one window: python -m scrapy.utils.benchserver Run spider: scrapy runspider cpu_spider.py Ovserve the output:

2018-04-02 19:29:50 [scrapy.core.engine] INFO: Spider opened
2018-04-02 19:29:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-02 19:29:50 [root] INFO: latency 0.01 s after 0 s
2018-04-02 19:29:51 [root] INFO: latency 0.11 s after 1 s
2018-04-02 19:29:51 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 180 pages/min), scraped 17 items (at 1020 items/min)
...
2018-04-02 19:30:02 [root] INFO: latency 0.94 s after 12 s
2018-04-02 19:30:02 [root] INFO: latency 0.94 s after 12 s
2018-04-02 19:30:03 [scrapy.extensions.logstats] INFO: Crawled 34 pages (at 360 pages/min), scraped 239 items (at 2340 items/min)
...
2018-04-02 19:30:57 [root] INFO: latency 8.41 s after 67 s
2018-04-02 19:31:02 [scrapy.extensions.logstats] INFO: Crawled 101 pages (at 480 pages/min), scraped 1318 items (at 5940 items/min)
...
2018-04-02 19:31:56 [root] INFO: latency 12.77 s after 126 s
2018-04-02 19:32:01 [scrapy.extensions.logstats] INFO: Crawled 141 pages (at 480 pages/min), scraped 2410 items (at 8280 items/min)
...
2018-04-02 19:34:36 [root] INFO: latency 20.37 s after 286 s
2018-04-02 19:34:47 [scrapy.extensions.logstats] INFO: Crawled 213 pages (at 480 pages/min), scraped 5470 items (at 12840 items/min)
...
2018-04-02 19:37:02 [root] INFO: latency 25.34 s after 431 s
2018-04-02 19:37:08 [scrapy.extensions.logstats] INFO: Crawled 261 pages (at 480 pages/min), scraped 8103 items (at 16260 items/min)
...
2018-04-02 19:47:57 [root] INFO: latency 40.99 s after 1087 s
2018-04-02 19:48:11 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 480 pages/min), scraped 20446 items (at 23820 items/min)
2018-04-02 19:48:32 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 0 pages/min), scraped 20835 items (at 23340 items/min)

So it seems that download latency (response.meta['download_latency']) is growing linearly, while the actual time the server spends on generating the response is very low.

To summarize, I expected download latency to remain reasonable and not to grow, since the spider is taking constant time to process items.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:5
  • Comments:16 (10 by maintainers)

github_iconTop GitHub Comments

3reactions
lopuhincommented, Apr 6, 2018

I think explanation of the problem and suggestions in the doc will already help a lot. But we also don’t have a good way for the user to diagnose the problem - ideally we should issue a warning when the I/O looped is blocked for a long time.

1reaction
lopuhincommented, Apr 3, 2018

Yes, I think this fits the topic of performance GSoC well, even if solution will only involve better monitoring and documentation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What Is Latency and How Do You Fix It? - Reviews.org
Because it's a measure of time delay, you want your latency to be as low as possible. Bandwidth measures how much data your...
Read more >
Increasing webcam FPS with Python and OpenCV
In order to accomplish this FPS increase/latency decrease, our goal is to move the reading of frames from a webcam or USB device...
Read more >
The Difference Between Latency vs Speed Explained
Latency refers to how quickly your online device can communicate, while speed measures the amount of data it can download or upload at...
Read more >
How are Latency & Internet Connection Related? - Optimum
How is Latency Measured? Internet speed tests are available online and through apps that you can download on your smartphone or tablet. Many...
Read more >
Does increasing upload or download speed reduce ping?
Latency is generally independent of bandwidth. Latency, as commonly measured by ping times, is an indication of how long it takes for your ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found