Round Robin Domain Crawling second scheduler to improve performance
See original GitHub issuePropose merging a DomainScheduler implemented in https://github.com/tianhuil/domain_scheduler into scrapy. It scrapes in a domain-smart way: by round-robin cycling through the domains. This has two benefits:
- Spreading out load on the target servers instead of hitting the server with many requests at once
- Reducing delays caused by server-throttling or scrapy’s own
CONCURRENT_REQUESTS_PER_IP
restrictions. Empirical testing has shown this to be quite effective.
It implements the solution proposed in https://github.com/scrapy/scrapy/issues/1802#issue-135260562 which found similar performance improvements. Original proposal was first posted https://github.com/scrapy/scrapy/issues/1802 and https://github.com/scrapy/scrapy/issues/2474.
Note: It requires more than just using SCHEDULER_PRIORITY_QUEUE
as it needs an API change to the queue (passing in a non-integer key, i.e. the domain, to a new round robin queue). Therefore, it is dependent on first merging scrapy/queuelib#21.
I’m happy to setup a PR to refactor Scheduler
to allow for both DomainScheduler
and Scheduler
to exist in scrapy
as there is significant code overlap if there’s interest.
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
@cathalgarvey: thanks for your thoughts. My proposal was to keep both the original Scheduler (perhaps renamed as
DefaultScheduler
) and aDomainScheduler
and have them configurable insettings.py
’sSCHEDULER
variable. Obviously, we would keepDefaultScheduler
the default choice.@tianhuil how we can use it with redis_scrapy?