Use priority queues for Downloader slot queues
See original GitHub issueCurrently downloader slots use collections.deque
for requests queue. It means that once request came from a scheduler to downloader, its priority is no longer respected.
Let’s say global concurrency limit is 10, scheduler returned 10 requests with a low priority (all for a single downloader slot), then user scheduled a request with a high priority (for the same slot), then one of 10 low-priority requests was processed, and downloader fetched high-priority request from a scheduler. In this case this new high-priority request will be only handled after 9 existing low-priority requests.
What about using a priority queue from queuelib instead of deque?
Issue Analytics
- State:
- Created 8 years ago
- Comments:10 (10 by maintainers)
Top Results From Across the Web
Multi-Level Priority Queues - Cisco
The Multi-Level Priority Queues (MPQ) feature allows you to configure multiple priority queues for multiple traffic classes by specifying a different priority ...
Read more >Queues: priority and delay - Amazon Connect
Priority and delay are powerful features that allow you to load balance contacts among groups of agents. Example 1: Different priority but same...
Read more >queue — A synchronized queue class — Python 3.11.1 ...
With a priority queue, the entries are kept sorted (using the heapq module) and the lowest valued entry is retrieved first. Internally, those...
Read more >What is the Python Priority Queue? | Linode
A queue that retrieves and removes items based on their priority as well as their arrival time is called a priority queue. Prioritization...
Read more >A sample time slot. On the left: a single priority queue with ...
Download scientific diagram | A sample time slot. On the left: a single priority queue with buffer of size B = 6 ;...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Lets test this script with various settings
script
1. Default concurrency settings (
CONCURRENT_REQUESTS=16
,CONCURRENT_REQUESTS_PER_DOMAIN=8
)log output (default settings except "DOWNLOAD_DELAY":1)
2. Custom settings
{"DOWNLOAD_DELAY":1, "CONCURRENT_REQUESTS":1, "CONCURRENT_REQUESTS_PER_DOMAIN":1 }
With this confirugation requests priority will be counted from both scheduler and downloader sides(as it requested here) . Scheduler - because it already have priority queue. Downloader - because size of it’s queue reduced to size of 1 by custom settings (so downloader queue will always contain the most prioritized request).log output
This look better. But it still not expected strict order of requests (Low priority 1, High priority1, Low priority2, High priority 2, etc).
When downloader received first response (
...page1.html
) - application asked scheduler for next request to send it to downloader. As first response (...page1.html
) at that moment didn’t parsed (and it didn’t produced new high priority request) - it took next request from scheduler queue (low priority...page2.html
) end sent it to server. Technically application is still respects request priorities.Key point of this - is that low priority request moved from scheduler queue to downloader queue without waiting results of
parse
of received low priority request (which produce high priority request we expect to send next). In this case (as well as with implemented priority queue for downloader) we will not receive completely fixed/strict order of requests.It happened because… it allowed by default settings. https://github.com/scrapy/scrapy/blob/afa5881ada816a2fc5555f6272dbfe87f7973222/scrapy/settings/default_settings.py#L263 This setting means that it is allowed to send request from scheduler queue to downloader queue if total size of not parsed responses is less than
SCRAPER_SLOT_MAX_ACTIVE_SIZE
(~5mb) so this is direct reason of not strict order of requests3.Custom settings (reduced scraper slot max active size)
{"DOWNLOAD_DELAY":1, "CONCURRENT_REQUESTS":1, "CONCURRENT_REQUESTS_PER_DOMAIN":1, "SCRAPER_SLOT_MAX_ACTIVE_SIZE":0 }
log output
With this configuration: setting
SCRAPER_SLOT_MAX_ACTIVE_SIZE
to0
in addition to other reduced concurrency settings guarantee that next request from scheduler will be moved to downloader only after all received responses will be processed.This configuration may have slower runtime performance comparing to default ~5mb setting value (especially with lower or zero values of
DOWNLOAD_DELAY
setting) but it allows to make more… precise control of request sending/processing orderI am starting to think maybe we should not make any change code-wise here, and instead make sure the documentation explains clearly what @GeorgeA92 covered above.
On a related note: at the moment, the scheduler handles request feed order, and the downloader handles slots. But slots should be taken into account for proper request ordering, and so we end up with something like
DownloaderAwarePriorityQueue
for the scheduler. I wonder if we should move slot handling to the scheduler instead.