Extreme performance of the waste when CONCURRENT_ITEMS is large
See original GitHub issueDescription
There is Extreme performance of the waste When CONCURRENT_ITEMS set to a large number, such as 9999.
Some days ago, I wrote a spider with CONCURRENT_ITEMS=9999, and run it. I found my spider use 100% cpu and crawl speed is very very slow (40 pages/mins), normally it would be 400+pages/min.
Steps to Reproduce
Here is code , you can run directly. You would see cpu usage up to 100%. Wait 1 minute, you will see [scrapy.extensions.logstats] INFO: Crawled 27 pages (at 27 pages/min)
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
custom_settings = {
"LOG_LEVEL": "INFO",
"CONCURRENT_ITEMS": 99999,
}
def start_requests(self):
for _ in range(3000):
yield scrapy.Request(
url="http://httpbin.org/get",
dont_filter=True,
)
def parse(self, response):
pass
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
Trends in Solid Waste Management - World Bank
High -income countries generate relatively less food and green waste, at 32 percent of total waste, and generate more dry waste that could...
Read more >High-Level Waste | NRC.gov
High -level wastes take one of two forms: Spent (used) reactor fuel when it is accepted for disposal; Waste materials remaining after spent...
Read more >Radioactive Waste Management - World Nuclear Association
Nuclear waste is neither particularly hazardous nor hard to manage relative to other toxic industrial waste. Safe methods for the final disposal of...
Read more >Diverting waste from landfill - European Environment Agency
Effectiveness of waste‑management policies in the European Union. ISSN 1725‑9177 ... This publication is printed according to high environmental standards.
Read more >40 CFR Part 60 Subpart Eb - eCFR
Subpart Eb - Standards of Performance for Large Municipal Waste ... than 250 tons per day of municipal solid waste for which construction,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I suspect that might be to be expected.
I believe the root code for this is https://github.com/scrapy/scrapy/blob/7e23677b52b659b11471a63f3be9905a0bbaf995/scrapy/utils/defer.py#L72 (
count
comes fromCONCURRENT_ITEMS
). Quoting the first paragraph of the blog post linked from there as the source of this implementation:So maybe the best way forward here is to treat this as a documentation issue, and document how increasing this number may be counter-productive.
I also wonder if it would make sense to count the number of elements in
iterable
(assuming they are not infinite) and limitcount
to the lesser value, i.e.count = min(count, iterable_length)
. Maybe we could even implement this while supporting an arbitrarily large iterable, by peaking items from iterable up tocount
.@edorado93 's proposal sounds very good, better than my “sane defaults” approach. Withdrawing my PR, looking forward to the implementation!