Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Round Robin Domain Crawling second scheduler to improve performance

See original GitHub issue

Propose merging a DomainScheduler implemented in https://github.com/tianhuil/domain_scheduler into scrapy. It scrapes in a domain-smart way: by round-robin cycling through the domains. This has two benefits:

Spreading out load on the target servers instead of hitting the server with many requests at once
Reducing delays caused by server-throttling or scrapy’s own CONCURRENT_REQUESTS_PER_IP restrictions. Empirical testing has shown this to be quite effective.

It implements the solution proposed in https://github.com/scrapy/scrapy/issues/1802#issue-135260562 which found similar performance improvements. Original proposal was first posted https://github.com/scrapy/scrapy/issues/1802 and https://github.com/scrapy/scrapy/issues/2474.

Note: It requires more than just using SCHEDULER_PRIORITY_QUEUE as it needs an API change to the queue (passing in a non-integer key, i.e. the domain, to a new round robin queue). Therefore, it is dependent on first merging scrapy/queuelib#21.

I’m happy to setup a PR to refactor Scheduler to allow for both DomainScheduler and Scheduler to exist in scrapy as there is significant code overlap if there’s interest.

Issue Analytics

State:
Created 6 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

tianhuilcommented, Feb 26, 2018

@cathalgarvey: thanks for your thoughts. My proposal was to keep both the original Scheduler (perhaps renamed as DefaultScheduler) and a DomainScheduler and have them configurable in settings.py’s SCHEDULER variable. Obviously, we would keep DefaultScheduler the default choice.

0reactions

Rashid1152commented, May 6, 2021

@tianhuil how we can use it with redis_scrapy?

Top Results From Across the Web

A Task Scheduling Strategy Based on Weighted Round-Robin ...

The form of crawlers will gradually tend to distributed. This paper proposes a task scheduling strategy based on weighted Round-Robin for small- ...

Scrapy Update: Better Broad Crawl Performance | Zyte

Round -robin algorithm can be used for request scheduling: store all entities in FIFO queue Q; when the next request should be scheduled...

Design and Implementation of a High ... - CiteSeerX

“freshness” of a collection of pages [11, 10], or scheduling of crawling activity over time [25]. In contrast, there has been less work...

Scrapy Update: Better Broad Crawl Performance

Round -robin algorithm can be used for request scheduling: store all entities in FIFO queue Q; when the next request should be scheduled...

About Crawling Scheduling Problems - CEUR-WS

task for the purpose of enhancing efficiency (shortening the scanning time in this case). ... Keywords: Enterprise Desktop Grid, Web Crawling, Round-Robin.