Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CONCURRENT_REQUESTS_PER_DOMAIN ignored for start_urls

See original GitHub issue

Description

When DOWNLOAD_DELAY is set with a value > 0, the value of CONCURRENT_REQUESTS_PER_DOMAIN is ignored, when processing start_urls

Steps to Reproduce

Create an example spider

import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['google.com']
    start_urls = [f'https://www.google.com/search?q=scrapy{x}' for x in range(0, 20, 1)]

    def parse(self, response):
        self.log("*" * 100)
        pass

In settings.py

CONCURRENT_REQUESTS = 100
DOWNLOAD_DELAY = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 20

execute scrapy crawl example

Expected behavior: all 20 requests should be crawled without delay Actual behavior: the spider will crawl a page every 10 second

Reproduces how often: 100%

Versions

2.4.1, 2.5.0

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

Gallaeciocommented, Apr 21, 2021

After looking into it, I’ve identified that the problem is in https://github.com/scrapy/scrapy/blob/0e579182319504ba7bfd0c09333fe92f70c6d312/scrapy/core/downloader/__init__.py#L135-L147 , slot.lastseen is a single value, where maybe it should be a deque that rotates with the enqueued requests, or even be a part of the queue.

However, as I was trying to think of a way to make such a chance in a backward-compatible way, which I don’t think straightforward, I started to think that maybe there is no bug here, and the current implementation is how it should be. The DOWNLOAD_DELAY is the delay between requests to the same domain, while your expectations are for DOWNLOAD_DELAY to be the delay between each CONCURRENT_REQUESTS_PER_DOMAIN requests, and I think the former interpretation is much more useful.

The current implementation will make requests with 10 seconds of delay between them, but will also stop sending requests if the server responses are so slow that even with that delay by the time it’s time to send the 21st request none of the responses have arrived from the server. This may sound crazy with such high numbers, but with lower numbers for both settings it makes a lot of sense.

What you are suggesting would result in batches of requests being sent at the same time. This is not good for servers. Your requests should be distributed in time, not send N requests and then wait M seconds before you send another batch of N requests simultaneously.

In summary, I don’t think this is a bug, and CONCURRENT_REQUESTS_PER_DOMAIN is not ignored, it just does not work as you expected and it’s function will likely not come into play in your example because Google will send responses very quickly.

If you can justify the need for the behavior you expect, maybe we can make of this an enhancement to make such a behavior possible. But I honestly cannot think of a reason why you would want to do that when scraping someone else’s server.

0reactions

Gallaeciocommented, Jun 20, 2022

We should make sure https://github.com/scrapy/scrapy/issues/5083#issuecomment-824100114 is clear in the documentation of DOWNLOAD_DELAY, this is bound to be a common misunderstanding.

Top Results From Across the Web

python - Scrapy download_delay vs ...

start_urls holds 100 URLs for a given domain; MAX_CONCURRENT_REQUESTS_PER_DOMAIN = 8; DOWNLOAD_DELAY = 3; assume the server takes 2 seconds to ...

Settings — Scrapy 2.7.1 documentation

If non-zero, the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words, concurrency limits will be ...

Feeding Scrapy start urls with Generator - Google Groups

But BaseSpider iterates over all start_urls generating all request ... into concurrent request value. ... domain = 'http://'+domain.rstrip() yield domain

Scraping Multiple Pages with Scrapy - Proxies API

The allowed_domains array restricts all further crawling to the domain paths specified here. start_urls is the list of URLs to crawl... for us, ......

Web Application Manifest - W3C

This document was published by the Web Applications Working Group as a ... The start_url member is purely advisory, and a user agent...