question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Extreme performance of the waste when CONCURRENT_ITEMS is large

See original GitHub issue

Description

There is Extreme performance of the waste When CONCURRENT_ITEMS set to a large number, such as 9999.

Some days ago, I wrote a spider with CONCURRENT_ITEMS=9999, and run it. I found my spider use 100% cpu and crawl speed is very very slow (40 pages/mins), normally it would be 400+pages/min.

Steps to Reproduce

Here is code , you can run directly. You would see cpu usage up to 100%. Wait 1 minute, you will see [scrapy.extensions.logstats] INFO: Crawled 27 pages (at 27 pages/min)


import scrapy


class TestSpider(scrapy.Spider):
    name = 'test'

    custom_settings = {
        "LOG_LEVEL": "INFO",
        "CONCURRENT_ITEMS": 99999,
    }

    def start_requests(self):
        for _ in range(3000):
            yield scrapy.Request(
                url="http://httpbin.org/get",
                dont_filter=True,
            )

    def parse(self, response):
        pass

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
Gallaeciocommented, Jul 13, 2021

I suspect that might be to be expected.

I believe the root code for this is https://github.com/scrapy/scrapy/blob/7e23677b52b659b11471a63f3be9905a0bbaf995/scrapy/utils/defer.py#L72 (count comes from CONCURRENT_ITEMS). Quoting the first paragraph of the blog post linked from there as the source of this implementation:

Concurrency can be a great way to speed things up, but what happens when you have too much concurrency? Overloading a system or a network can be detrimental to performance.

So maybe the best way forward here is to treat this as a documentation issue, and document how increasing this number may be counter-productive.

I also wonder if it would make sense to count the number of elements in iterable (assuming they are not infinite) and limit count to the lesser value, i.e. count = min(count, iterable_length). Maybe we could even implement this while supporting an arbitrarily large iterable, by peaking items from iterable up to count.

1reaction
vgerakcommented, Oct 16, 2021

@edorado93 's proposal sounds very good, better than my “sane defaults” approach. Withdrawing my PR, looking forward to the implementation!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Trends in Solid Waste Management - World Bank
High -income countries generate relatively less food and green waste, at 32 percent of total waste, and generate more dry waste that could...
Read more >
High-Level Waste | NRC.gov
High -level wastes take one of two forms: Spent (used) reactor fuel when it is accepted for disposal; Waste materials remaining after spent...
Read more >
Radioactive Waste Management - World Nuclear Association
Nuclear waste is neither particularly hazardous nor hard to manage relative to other toxic industrial waste. Safe methods for the final disposal of...
Read more >
Diverting waste from landfill - European Environment Agency
Effectiveness of waste‑management policies in the European Union. ISSN 1725‑9177 ... This publication is printed according to high environmental standards.
Read more >
40 CFR Part 60 Subpart Eb - eCFR
Subpart Eb - Standards of Performance for Large Municipal Waste ... than 250 tons per day of municipal solid waste for which construction,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found