Growing download latency of a CPU-heavy spider
See original GitHub issueThe issue manifests itself as a growing latency when the spider is relatively CPU-intensive and is sending a lot of requests. Here is an example python 3 spider, based on scrapy bench spider:
from urllib.parse import urlencode
import time
import scrapy
from scrapy.linkextractor import LinkExtractor
import logging
class _BenchSpider(scrapy.Spider):
"""A spider that follows all links"""
name = 'follow'
total = 10000
show = 20
baseurl = 'http://localhost:8998'
link_extractor = LinkExtractor()
custom_settings = {
'LOG_LEVEL': 'INFO',
'LOGSTATS_INTERVAL': 1,
'CONCURRENT_REQUEST': 32, # changed
'CONCURRENT_REQUEST_PER_DOMAIN': 32, # changed
}
def start_requests(self):
self.t0 = time.time()
qargs = {'total': self.total, 'show': self.show}
url = '{}?{}'.format(self.baseurl, urlencode(qargs, doseq=1))
return [scrapy.Request(url, dont_filter=True)]
def parse(self, response):
# add latency reporting
logging.info('latency {:.2f} s after {:.0f} s'.format(
response.meta['download_latency'], time.time() - self.t0))
for link in self.link_extractor.extract_links(response):
for i in range(5): # added CPU work extracting items
time.sleep(0.05)
yield {'step': i}
yield scrapy.Request(link.url, callback=self.parse)
Run server in one window: python -m scrapy.utils.benchserver
Run spider: scrapy runspider cpu_spider.py
Ovserve the output:
2018-04-02 19:29:50 [scrapy.core.engine] INFO: Spider opened
2018-04-02 19:29:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-02 19:29:50 [root] INFO: latency 0.01 s after 0 s
2018-04-02 19:29:51 [root] INFO: latency 0.11 s after 1 s
2018-04-02 19:29:51 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 180 pages/min), scraped 17 items (at 1020 items/min)
...
2018-04-02 19:30:02 [root] INFO: latency 0.94 s after 12 s
2018-04-02 19:30:02 [root] INFO: latency 0.94 s after 12 s
2018-04-02 19:30:03 [scrapy.extensions.logstats] INFO: Crawled 34 pages (at 360 pages/min), scraped 239 items (at 2340 items/min)
...
2018-04-02 19:30:57 [root] INFO: latency 8.41 s after 67 s
2018-04-02 19:31:02 [scrapy.extensions.logstats] INFO: Crawled 101 pages (at 480 pages/min), scraped 1318 items (at 5940 items/min)
...
2018-04-02 19:31:56 [root] INFO: latency 12.77 s after 126 s
2018-04-02 19:32:01 [scrapy.extensions.logstats] INFO: Crawled 141 pages (at 480 pages/min), scraped 2410 items (at 8280 items/min)
...
2018-04-02 19:34:36 [root] INFO: latency 20.37 s after 286 s
2018-04-02 19:34:47 [scrapy.extensions.logstats] INFO: Crawled 213 pages (at 480 pages/min), scraped 5470 items (at 12840 items/min)
...
2018-04-02 19:37:02 [root] INFO: latency 25.34 s after 431 s
2018-04-02 19:37:08 [scrapy.extensions.logstats] INFO: Crawled 261 pages (at 480 pages/min), scraped 8103 items (at 16260 items/min)
...
2018-04-02 19:47:57 [root] INFO: latency 40.99 s after 1087 s
2018-04-02 19:48:11 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 480 pages/min), scraped 20446 items (at 23820 items/min)
2018-04-02 19:48:32 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 0 pages/min), scraped 20835 items (at 23340 items/min)
So it seems that download latency (response.meta['download_latency']
) is growing linearly, while the actual time the server spends on generating the response is very low.
To summarize, I expected download latency to remain reasonable and not to grow, since the spider is taking constant time to process items.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:5
- Comments:16 (10 by maintainers)
Top Results From Across the Web
What Is Latency and How Do You Fix It? - Reviews.org
Because it's a measure of time delay, you want your latency to be as low as possible. Bandwidth measures how much data your...
Read more >Increasing webcam FPS with Python and OpenCV
In order to accomplish this FPS increase/latency decrease, our goal is to move the reading of frames from a webcam or USB device...
Read more >The Difference Between Latency vs Speed Explained
Latency refers to how quickly your online device can communicate, while speed measures the amount of data it can download or upload at...
Read more >How are Latency & Internet Connection Related? - Optimum
How is Latency Measured? Internet speed tests are available online and through apps that you can download on your smartphone or tablet. Many...
Read more >Does increasing upload or download speed reduce ping?
Latency is generally independent of bandwidth. Latency, as commonly measured by ping times, is an indication of how long it takes for your ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think explanation of the problem and suggestions in the doc will already help a lot. But we also don’t have a good way for the user to diagnose the problem - ideally we should issue a warning when the I/O looped is blocked for a long time.
Yes, I think this fits the topic of performance GSoC well, even if solution will only involve better monitoring and documentation.