question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Round Robin Domain Crawling second scheduler to improve performance

See original GitHub issue

Propose merging a DomainScheduler implemented in https://github.com/tianhuil/domain_scheduler into scrapy. It scrapes in a domain-smart way: by round-robin cycling through the domains. This has two benefits:

  1. Spreading out load on the target servers instead of hitting the server with many requests at once
  2. Reducing delays caused by server-throttling or scrapy’s own CONCURRENT_REQUESTS_PER_IP restrictions. Empirical testing has shown this to be quite effective.

It implements the solution proposed in https://github.com/scrapy/scrapy/issues/1802#issue-135260562 which found similar performance improvements. Original proposal was first posted https://github.com/scrapy/scrapy/issues/1802 and https://github.com/scrapy/scrapy/issues/2474.

Note: It requires more than just using SCHEDULER_PRIORITY_QUEUE as it needs an API change to the queue (passing in a non-integer key, i.e. the domain, to a new round robin queue). Therefore, it is dependent on first merging scrapy/queuelib#21.

I’m happy to setup a PR to refactor Scheduler to allow for both DomainScheduler and Scheduler to exist in scrapy as there is significant code overlap if there’s interest.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
tianhuilcommented, Feb 26, 2018

@cathalgarvey: thanks for your thoughts. My proposal was to keep both the original Scheduler (perhaps renamed as DefaultScheduler) and a DomainScheduler and have them configurable in settings.py’s SCHEDULER variable. Obviously, we would keep DefaultScheduler the default choice.

0reactions
Rashid1152commented, May 6, 2021

@tianhuil how we can use it with redis_scrapy?

Read more comments on GitHub >

github_iconTop Results From Across the Web

A Task Scheduling Strategy Based on Weighted Round-Robin ...
The form of crawlers will gradually tend to distributed. This paper proposes a task scheduling strategy based on weighted Round-Robin for small- ...
Read more >
Scrapy Update: Better Broad Crawl Performance | Zyte
Round -robin algorithm can be used for request scheduling: store all entities in FIFO queue Q; when the next request should be scheduled...
Read more >
Design and Implementation of a High ... - CiteSeerX
“freshness” of a collection of pages [11, 10], or scheduling of crawling activity over time [25]. In contrast, there has been less work...
Read more >
Scrapy Update: Better Broad Crawl Performance
Round -robin algorithm can be used for request scheduling: store all entities in FIFO queue Q; when the next request should be scheduled...
Read more >
About Crawling Scheduling Problems - CEUR-WS
task for the purpose of enhancing efficiency (shortening the scanning time in this case). ... Keywords: Enterprise Desktop Grid, Web Crawling, Round-Robin.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found