question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use priority queues for Downloader slot queues

See original GitHub issue

Currently downloader slots use collections.deque for requests queue. It means that once request came from a scheduler to downloader, its priority is no longer respected.

Let’s say global concurrency limit is 10, scheduler returned 10 requests with a low priority (all for a single downloader slot), then user scheduled a request with a high priority (for the same slot), then one of 10 low-priority requests was processed, and downloader fetched high-priority request from a scheduler. In this case this new high-priority request will be only handled after 9 existing low-priority requests.

What about using a priority queue from queuelib instead of deque?

//cc @dangra @shirk3y

Issue Analytics

  • State:open
  • Created 8 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
GeorgeA92commented, May 30, 2022

Lets test this script with various settings

script
import scrapy; from scrapy.crawler import CrawlerProcess

class BooksToScrapeSpider(scrapy.Spider):
    name = "books"; start_urls = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1,32)]
    custom_settings = {"DOWNLOAD_DELAY":1}

    def parse(self, response):
        yield scrapy.Request(
            response.urljoin(response.css('ol.row .product_pod a::attr(href)').get()),
            callback=self.parse_book,
            priority=10
        )

    def parse_book(self, response):
        pass

process = CrawlerProcess(); process.crawl(BooksToScrapeSpider); process.start()

1. Default concurrency settings (CONCURRENT_REQUESTS=16, CONCURRENT_REQUESTS_PER_DOMAIN=8)

log output (default settings except "DOWNLOAD_DELAY":1)
2022-05-30 16:42:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-1.html> (referer: None)
2022-05-30 16:42:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-2.html> (referer: None)
2022-05-30 16:42:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-3.html> (referer: None)
2022-05-30 16:42:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-4.html> (referer: None)
2022-05-30 16:42:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-5.html> (referer: None)
2022-05-30 16:42:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-6.html> (referer: None)
2022-05-30 16:42:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-7.html> (referer: None)
2022-05-30 16:42:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-8.html> (referer: None)
2022-05-30 16:42:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-9.html> (referer: None)
2022-05-30 16:42:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-10.html> (referer: None)
2022-05-30 16:42:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-11.html> (referer: None)
2022-05-30 16:42:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-12.html> (referer: None)
2022-05-30 16:42:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-13.html> (referer: None)
2022-05-30 16:42:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-14.html> (referer: None)
2022-05-30 16:42:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-15.html> (referer: None)
2022-05-30 16:42:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-16.html> (referer: None)
2022-05-30 16:42:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-17.html> (referer: None)
2022-05-30 16:42:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html> (referer: https://books.toscrape.com/catalogue/page-1.html)
2022-05-30 16:42:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/in-her-wake_980/index.html> (referer: https://books.toscrape.com/catalogue/page-2.html)
2022-05-30 16:42:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/slow-states-of-collapse-poems_960/index.html> (referer: https://books.toscrape.com/catalogue/page-3.html)
2022-05-30 16:42:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-nameless-city-the-nameless-city-1_940/index.html> (referer: https://books.toscrape.com/catalogue/page-4.html)
2022-05-30 16:42:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/princess-jellyfish-2-in-1-omnibus-vol-01-princess-jellyfish-2-in-1-omnibus-1_920/index.html> (referer: https://books.toscrape.com/catalogue/page-5.html)
2022-05-30 16:42:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/immunity-how-elie-metchnikoff-changed-the-course-of-modern-medicine_900/index.html> (referer: https://books.toscrape.com/catalogue/page-6.html)
2022-05-30 16:42:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/algorithms-to-live-by-the-computer-science-of-human-decisions_880/index.html> (referer: https://books.toscrape.com/catalogue/page-7.html)
2022-05-30 16:42:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-shadow-hero-the-shadow-hero_860/index.html> (referer: https://books.toscrape.com/catalogue/page-8.html)
2022-05-30 16:42:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-bridge-to-consciousness-im-writing-the-bridge-between-science-and-our-old-and-new-beliefs_840/index.html> (referer: https://books.toscrape.com/catalogue/page-9.html)
2022-05-30 16:42:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/modern-romance_820/index.html> (referer: https://books.toscrape.com/catalogue/page-10.html)
2022-05-30 16:42:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/dark-notes_800/index.html> (referer: https://books.toscrape.com/catalogue/page-11.html)
2022-05-30 16:42:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/whole-lotta-creativity-going-on-60-fun-and-unusual-exercises-to-awaken-and-strengthen-your-creativity_780/index.html> (referer: https://books.toscrape.com/catalogue/page-12.html)
2022-05-30 16:42:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-power-of-habit-why-we-do-what-we-do-in-life-and-business_760/index.html> (referer: https://books.toscrape.com/catalogue/page-13.html)
2022-05-30 16:42:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/swell-a-year-of-waves_740/index.html> (referer: https://books.toscrape.com/catalogue/page-14.html)
2022-05-30 16:42:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/my-name-is-lucy-barton_720/index.html> (referer: https://books.toscrape.com/catalogue/page-15.html)
2022-05-30 16:42:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/hold-your-breath-search-and-rescue-1_700/index.html> (referer: https://books.toscrape.com/catalogue/page-16.html)
2022-05-30 16:42:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/code-name-verity-code-name-verity-1_680/index.html> (referer: https://books.toscrape.com/catalogue/page-17.html)
2022-05-30 16:42:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-18.html> (referer: None)
2022-05-30 16:42:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-19.html> (referer: None)
2022-05-30 16:42:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-20.html> (referer: None)
2022-05-30 16:42:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-21.html> (referer: None)
2022-05-30 16:43:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-22.html> (referer: None)
2022-05-30 16:43:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-23.html> (referer: None)
2022-05-30 16:43:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-24.html> (referer: None)
2022-05-30 16:43:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-25.html> (referer: None)
2022-05-30 16:43:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-26.html> (referer: None)
2022-05-30 16:43:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-27.html> (referer: None)
2022-05-30 16:43:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-28.html> (referer: None)
2022-05-30 16:43:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-29.html> (referer: None)
2022-05-30 16:43:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-30.html> (referer: None)
2022-05-30 16:43:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-31.html> (referer: None)
2022-05-30 16:43:11 [scrapy.extensions.logstats] INFO: Crawled 48 pages (at 48 pages/min), scraped 0 items (at 0 items/min)
2022-05-30 16:43:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/all-the-light-we-cannot-see_660/index.html> (referer: https://books.toscrape.com/catalogue/page-18.html)
2022-05-30 16:43:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-midnight-watch-a-novel-of-the-titanic-and-the-californian_640/index.html> (referer: https://books.toscrape.com/catalogue/page-19.html)
2022-05-30 16:43:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/hide-away-eve-duncan-20_620/index.html> (referer: https://books.toscrape.com/catalogue/page-20.html)
2022-05-30 16:43:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/mothering-sunday_600/index.html> (referer: https://books.toscrape.com/catalogue/page-21.html)
2022-05-30 16:43:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/without-shame_580/index.html> (referer: https://books.toscrape.com/catalogue/page-22.html)
2022-05-30 16:43:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/chernobyl-012340-the-incredible-true-story-of-the-worlds-worst-nuclear-disaster_560/index.html> (referer: https://books.toscrape.com/catalogue/page-23.html)
2022-05-30 16:43:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/roller-girl_540/index.html> (referer: https://books.toscrape.com/catalogue/page-24.html)
2022-05-30 16:43:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/heaven-is-for-real-a-little-boys-astounding-story-of-his-trip-to-heaven-and-back_520/index.html> (referer: https://books.toscrape.com/catalogue/page-25.html)
2022-05-30 16:43:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-story-of-art_500/index.html> (referer: https://books.toscrape.com/catalogue/page-26.html)
2022-05-30 16:43:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/nightstruck-a-novel_480/index.html> (referer: https://books.toscrape.com/catalogue/page-27.html)
2022-05-30 16:43:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/benjamin-franklin-an-american-life_460/index.html> (referer: https://books.toscrape.com/catalogue/page-28.html)
2022-05-30 16:43:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-barefoot-contessa-cookbook_440/index.html> (referer: https://books.toscrape.com/catalogue/page-29.html)
2022-05-30 16:43:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/world-without-end-the-pillars-of-the-earth-2_420/index.html> (referer: https://books.toscrape.com/catalogue/page-30.html)
2022-05-30 16:43:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-dream-thieves-the-raven-cycle-2_400/index.html> (referer: https://books.toscrape.com/catalogue/page-31.html)
2022-05-30 16:43:26 [scrapy.core.engine] INFO: Closing spider (finished)
Here we see directly the same requests processing order as on first message of this issue.

2. Custom settings {"DOWNLOAD_DELAY":1, "CONCURRENT_REQUESTS":1, "CONCURRENT_REQUESTS_PER_DOMAIN":1 } With this confirugation requests priority will be counted from both scheduler and downloader sides(as it requested here) . Scheduler - because it already have priority queue. Downloader - because size of it’s queue reduced to size of 1 by custom settings (so downloader queue will always contain the most prioritized request).

log output
2022-05-30 16:53:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-1.html> (referer: None)
2022-05-30 16:53:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-2.html> (referer: None)
2022-05-30 16:53:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html> (referer: https://books.toscrape.com/catalogue/page-1.html)
2022-05-30 16:53:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/in-her-wake_980/index.html> (referer: https://books.toscrape.com/catalogue/page-2.html)
2022-05-30 16:53:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-3.html> (referer: None)
2022-05-30 16:53:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-4.html> (referer: None)
2022-05-30 16:53:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/slow-states-of-collapse-poems_960/index.html> (referer: https://books.toscrape.com/catalogue/page-3.html)
2022-05-30 16:53:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-nameless-city-the-nameless-city-1_940/index.html> (referer: https://books.toscrape.com/catalogue/page-4.html)
2022-05-30 16:53:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-5.html> (referer: None)
2022-05-30 16:53:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-6.html> (referer: None)
2022-05-30 16:53:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/princess-jellyfish-2-in-1-omnibus-vol-01-princess-jellyfish-2-in-1-omnibus-1_920/index.html> (referer: https://books.toscrape.com/catalogue/page-5.html)
2022-05-30 16:53:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/immunity-how-elie-metchnikoff-changed-the-course-of-modern-medicine_900/index.html> (referer: https://books.toscrape.com/catalogue/page-6.html)
2022-05-30 16:53:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-7.html> (referer: None)
2022-05-30 16:53:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-8.html> (referer: None)
2022-05-30 16:53:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/algorithms-to-live-by-the-computer-science-of-human-decisions_880/index.html> (referer: https://books.toscrape.com/catalogue/page-7.html)
2022-05-30 16:53:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-shadow-hero-the-shadow-hero_860/index.html> (referer: https://books.toscrape.com/catalogue/page-8.html)
2022-05-30 16:53:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-9.html> (referer: None)
2022-05-30 16:53:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-10.html> (referer: None)
2022-05-30 16:53:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-bridge-to-consciousness-im-writing-the-bridge-between-science-and-our-old-and-new-beliefs_840/index.html> (referer: https://books.toscrape.com/catalogue/page-9.html)
2022-05-30 16:53:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/modern-romance_820/index.html> (referer: https://books.toscrape.com/catalogue/page-10.html)
2022-05-30 16:53:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-11.html> (referer: None)
2022-05-30 16:53:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-12.html> (referer: None)
2022-05-30 16:53:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/dark-notes_800/index.html> (referer: https://books.toscrape.com/catalogue/page-11.html)
2022-05-30 16:53:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/whole-lotta-creativity-going-on-60-fun-and-unusual-exercises-to-awaken-and-strengthen-your-creativity_780/index.html> (referer: https://books.toscrape.com/catalogue/page-12.html)
2022-05-30 16:53:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-13.html> (referer: None)
2022-05-30 16:53:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-14.html> (referer: None)
2022-05-30 16:53:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-power-of-habit-why-we-do-what-we-do-in-life-and-business_760/index.html> (referer: https://books.toscrape.com/catalogue/page-13.html)
2022-05-30 16:54:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/swell-a-year-of-waves_740/index.html> (referer: https://books.toscrape.com/catalogue/page-14.html)
2022-05-30 16:54:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-15.html> (referer: None)
2022-05-30 16:54:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-16.html> (referer: None)
2022-05-30 16:54:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/my-name-is-lucy-barton_720/index.html> (referer: https://books.toscrape.com/catalogue/page-15.html)
2022-05-30 16:54:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/hold-your-breath-search-and-rescue-1_700/index.html> (referer: https://books.toscrape.com/catalogue/page-16.html)
2022-05-30 16:54:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-17.html> (referer: None)
2022-05-30 16:54:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-18.html> (referer: None)
2022-05-30 16:54:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/code-name-verity-code-name-verity-1_680/index.html> (referer: https://books.toscrape.com/catalogue/page-17.html)
2022-05-30 16:54:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/all-the-light-we-cannot-see_660/index.html> (referer: https://books.toscrape.com/catalogue/page-18.html)
2022-05-30 16:54:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-19.html> (referer: None)
2022-05-30 16:54:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-20.html> (referer: None)
2022-05-30 16:54:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-midnight-watch-a-novel-of-the-titanic-and-the-californian_640/index.html> (referer: https://books.toscrape.com/catalogue/page-19.html)
2022-05-30 16:54:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/hide-away-eve-duncan-20_620/index.html> (referer: https://books.toscrape.com/catalogue/page-20.html)
2022-05-30 16:54:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-21.html> (referer: None)
2022-05-30 16:54:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-22.html> (referer: None)
2022-05-30 16:54:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/mothering-sunday_600/index.html> (referer: https://books.toscrape.com/catalogue/page-21.html)
2022-05-30 16:54:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/without-shame_580/index.html> (referer: https://books.toscrape.com/catalogue/page-22.html)
2022-05-30 16:54:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-23.html> (referer: None)
2022-05-30 16:54:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-24.html> (referer: None)
2022-05-30 16:54:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/chernobyl-012340-the-incredible-true-story-of-the-worlds-worst-nuclear-disaster_560/index.html> (referer: https://books.toscrape.com/catalogue/page-23.html)
2022-05-30 16:54:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/roller-girl_540/index.html> (referer: https://books.toscrape.com/catalogue/page-24.html)
2022-05-30 16:54:25 [scrapy.extensions.logstats] INFO: Crawled 48 pages (at 48 pages/min), scraped 0 items (at 0 items/min)
2022-05-30 16:54:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-25.html> (referer: None)
2022-05-30 16:54:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-26.html> (referer: None)
2022-05-30 16:54:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/heaven-is-for-real-a-little-boys-astounding-story-of-his-trip-to-heaven-and-back_520/index.html> (referer: https://books.toscrape.com/catalogue/page-25.html)
2022-05-30 16:54:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-story-of-art_500/index.html> (referer: https://books.toscrape.com/catalogue/page-26.html)
2022-05-30 16:54:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-27.html> (referer: None)
2022-05-30 16:54:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-28.html> (referer: None)
2022-05-30 16:54:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/nightstruck-a-novel_480/index.html> (referer: https://books.toscrape.com/catalogue/page-27.html)
2022-05-30 16:54:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/benjamin-franklin-an-american-life_460/index.html> (referer: https://books.toscrape.com/catalogue/page-28.html)
2022-05-30 16:54:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-29.html> (referer: None)
2022-05-30 16:54:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-30.html> (referer: None)
2022-05-30 16:54:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-barefoot-contessa-cookbook_440/index.html> (referer: https://books.toscrape.com/catalogue/page-29.html)
2022-05-30 16:54:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/world-without-end-the-pillars-of-the-earth-2_420/index.html> (referer: https://books.toscrape.com/catalogue/page-30.html)
2022-05-30 16:54:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-31.html> (referer: None)
2022-05-30 16:54:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-dream-thieves-the-raven-cycle-2_400/index.html> (referer: https://books.toscrape.com/catalogue/page-31.html)

This look better. But it still not expected strict order of requests (Low priority 1, High priority1, Low priority2, High priority 2, etc).

When downloader received first response (...page1.html) - application asked scheduler for next request to send it to downloader. As first response (...page1.html) at that moment didn’t parsed (and it didn’t produced new high priority request) - it took next request from scheduler queue (low priority ...page2.html) end sent it to server. Technically application is still respects request priorities.

Key point of this - is that low priority request moved from scheduler queue to downloader queue without waiting results of parse of received low priority request (which produce high priority request we expect to send next). In this case (as well as with implemented priority queue for downloader) we will not receive completely fixed/strict order of requests.

It happened because… it allowed by default settings. https://github.com/scrapy/scrapy/blob/afa5881ada816a2fc5555f6272dbfe87f7973222/scrapy/settings/default_settings.py#L263 This setting means that it is allowed to send request from scheduler queue to downloader queue if total size of not parsed responses is less than SCRAPER_SLOT_MAX_ACTIVE_SIZE(~5mb) so this is direct reason of not strict order of requests

3.Custom settings (reduced scraper slot max active size) {"DOWNLOAD_DELAY":1, "CONCURRENT_REQUESTS":1, "CONCURRENT_REQUESTS_PER_DOMAIN":1, "SCRAPER_SLOT_MAX_ACTIVE_SIZE":0 }

log output
2022-05-30 18:07:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-1.html> (referer: None)
2022-05-30 18:07:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html> (referer: https://books.toscrape.com/catalogue/page-1.html)
2022-05-30 18:07:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-2.html> (referer: None)
2022-05-30 18:07:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/in-her-wake_980/index.html> (referer: https://books.toscrape.com/catalogue/page-2.html)
2022-05-30 18:07:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-3.html> (referer: None)
2022-05-30 18:07:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/slow-states-of-collapse-poems_960/index.html> (referer: https://books.toscrape.com/catalogue/page-3.html)
2022-05-30 18:07:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-4.html> (referer: None)
2022-05-30 18:07:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-nameless-city-the-nameless-city-1_940/index.html> (referer: https://books.toscrape.com/catalogue/page-4.html)
2022-05-30 18:07:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-5.html> (referer: None)
2022-05-30 18:08:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/princess-jellyfish-2-in-1-omnibus-vol-01-princess-jellyfish-2-in-1-omnibus-1_920/index.html> (referer: https://books.toscrape.com/catalogue/page-5.html)
2022-05-30 18:08:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-6.html> (referer: None)
2022-05-30 18:08:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/immunity-how-elie-metchnikoff-changed-the-course-of-modern-medicine_900/index.html> (referer: https://books.toscrape.com/catalogue/page-6.html)
2022-05-30 18:08:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-7.html> (referer: None)
2022-05-30 18:08:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/algorithms-to-live-by-the-computer-science-of-human-decisions_880/index.html> (referer: https://books.toscrape.com/catalogue/page-7.html)
2022-05-30 18:08:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-8.html> (referer: None)
2022-05-30 18:08:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-shadow-hero-the-shadow-hero_860/index.html> (referer: https://books.toscrape.com/catalogue/page-8.html)
2022-05-30 18:08:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-9.html> (referer: None)
2022-05-30 18:08:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-bridge-to-consciousness-im-writing-the-bridge-between-science-and-our-old-and-new-beliefs_840/index.html> (referer: https://books.toscrape.com/catalogue/page-9.html)
2022-05-30 18:08:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-10.html> (referer: None)
2022-05-30 18:08:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/modern-romance_820/index.html> (referer: https://books.toscrape.com/catalogue/page-10.html)
2022-05-30 18:08:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-11.html> (referer: None)
2022-05-30 18:08:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/dark-notes_800/index.html> (referer: https://books.toscrape.com/catalogue/page-11.html)
2022-05-30 18:08:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-12.html> (referer: None)
2022-05-30 18:08:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/whole-lotta-creativity-going-on-60-fun-and-unusual-exercises-to-awaken-and-strengthen-your-creativity_780/index.html> (referer: https://books.toscrape.com/catalogue/page-12.html)
2022-05-30 18:08:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-13.html> (referer: None)
2022-05-30 18:08:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-power-of-habit-why-we-do-what-we-do-in-life-and-business_760/index.html> (referer: https://books.toscrape.com/catalogue/page-13.html)
2022-05-30 18:08:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-14.html> (referer: None)
2022-05-30 18:08:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/swell-a-year-of-waves_740/index.html> (referer: https://books.toscrape.com/catalogue/page-14.html)
2022-05-30 18:08:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-15.html> (referer: None)
2022-05-30 18:08:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/my-name-is-lucy-barton_720/index.html> (referer: https://books.toscrape.com/catalogue/page-15.html)
2022-05-30 18:08:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-16.html> (referer: None)
2022-05-30 18:08:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/hold-your-breath-search-and-rescue-1_700/index.html> (referer: https://books.toscrape.com/catalogue/page-16.html)
2022-05-30 18:08:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-17.html> (referer: None)
2022-05-30 18:08:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/code-name-verity-code-name-verity-1_680/index.html> (referer: https://books.toscrape.com/catalogue/page-17.html)
2022-05-30 18:08:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-18.html> (referer: None)
2022-05-30 18:08:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/all-the-light-we-cannot-see_660/index.html> (referer: https://books.toscrape.com/catalogue/page-18.html)
2022-05-30 18:08:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-19.html> (referer: None)
2022-05-30 18:08:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-midnight-watch-a-novel-of-the-titanic-and-the-californian_640/index.html> (referer: https://books.toscrape.com/catalogue/page-19.html)
2022-05-30 18:08:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-20.html> (referer: None)
2022-05-30 18:08:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/hide-away-eve-duncan-20_620/index.html> (referer: https://books.toscrape.com/catalogue/page-20.html)
2022-05-30 18:08:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-21.html> (referer: None)
2022-05-30 18:08:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/mothering-sunday_600/index.html> (referer: https://books.toscrape.com/catalogue/page-21.html)
2022-05-30 18:08:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-22.html> (referer: None)
2022-05-30 18:08:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/without-shame_580/index.html> (referer: https://books.toscrape.com/catalogue/page-22.html)
2022-05-30 18:08:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-23.html> (referer: None)
2022-05-30 18:08:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/chernobyl-012340-the-incredible-true-story-of-the-worlds-worst-nuclear-disaster_560/index.html> (referer: https://books.toscrape.com/catalogue/page-23.html)
2022-05-30 18:08:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-24.html> (referer: None)
2022-05-30 18:08:48 [scrapy.extensions.logstats] INFO: Crawled 47 pages (at 47 pages/min), scraped 0 items (at 0 items/min)
2022-05-30 18:08:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/roller-girl_540/index.html> (referer: https://books.toscrape.com/catalogue/page-24.html)
2022-05-30 18:08:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-25.html> (referer: None)
2022-05-30 18:08:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/heaven-is-for-real-a-little-boys-astounding-story-of-his-trip-to-heaven-and-back_520/index.html> (referer: https://books.toscrape.com/catalogue/page-25.html)
2022-05-30 18:08:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-26.html> (referer: None)
2022-05-30 18:08:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-story-of-art_500/index.html> (referer: https://books.toscrape.com/catalogue/page-26.html)
2022-05-30 18:08:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-27.html> (referer: None)
2022-05-30 18:08:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/nightstruck-a-novel_480/index.html> (referer: https://books.toscrape.com/catalogue/page-27.html)
2022-05-30 18:08:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-28.html> (referer: None)
2022-05-30 18:08:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/benjamin-franklin-an-american-life_460/index.html> (referer: https://books.toscrape.com/catalogue/page-28.html)
2022-05-30 18:08:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-29.html> (referer: None)
2022-05-30 18:08:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-barefoot-contessa-cookbook_440/index.html> (referer: https://books.toscrape.com/catalogue/page-29.html)
2022-05-30 18:09:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-30.html> (referer: None)
2022-05-30 18:09:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/world-without-end-the-pillars-of-the-earth-2_420/index.html> (referer: https://books.toscrape.com/catalogue/page-30.html)
2022-05-30 18:09:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/page-31.html> (referer: None)
2022-05-30 18:09:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/the-dream-thieves-the-raven-cycle-2_400/index.html> (referer: https://books.toscrape.com/catalogue/page-31.html)

With this configuration: setting SCRAPER_SLOT_MAX_ACTIVE_SIZE to 0 in addition to other reduced concurrency settings guarantee that next request from scheduler will be moved to downloader only after all received responses will be processed.

This configuration may have slower runtime performance comparing to default ~5mb setting value (especially with lower or zero values of DOWNLOAD_DELAY setting) but it allows to make more… precise control of request sending/processing order

0reactions
Gallaeciocommented, May 31, 2022

I am starting to think maybe we should not make any change code-wise here, and instead make sure the documentation explains clearly what @GeorgeA92 covered above.

On a related note: at the moment, the scheduler handles request feed order, and the downloader handles slots. But slots should be taken into account for proper request ordering, and so we end up with something like DownloaderAwarePriorityQueue for the scheduler. I wonder if we should move slot handling to the scheduler instead.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi-Level Priority Queues - Cisco
The Multi-Level Priority Queues (MPQ) feature allows you to configure multiple priority queues for multiple traffic classes by specifying a different priority ...
Read more >
Queues: priority and delay - Amazon Connect
Priority and delay are powerful features that allow you to load balance contacts among groups of agents. Example 1: Different priority but same...
Read more >
queue — A synchronized queue class — Python 3.11.1 ...
With a priority queue, the entries are kept sorted (using the heapq module) and the lowest valued entry is retrieved first. Internally, those...
Read more >
What is the Python Priority Queue? | Linode
A queue that retrieves and removes items based on their priority as well as their arrival time is called a priority queue. Prioritization...
Read more >
A sample time slot. On the left: a single priority queue with ...
Download scientific diagram | A sample time slot. On the left: a single priority queue with buffer of size B = 6 ;...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found