Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support limiting the number of requests per interval

See original GitHub issue

Many Web site’s Open API limits the maximum number of requests in a certain interval from an IP address, Like 40 requests per minute. How ever, the current arguments are CONCURRENT_REQUESTS,CONCURRENT_REQUESTS_PER_DOMAIN CONCURRENT_REQUESTS_PER_IP, and DOWNLOAD_DELAY. Which depend on the duration of completing requests, so I feel difficult to adjust according to the threshold in API. To achieve high performance and don’t exceed the threshold of API, I suggests adding arguments like MAX_REQUESTS_PER_MINUTE.

Thanks!

Issue Analytics

State:
Created 11 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

3reactions

eLRuLLcommented, Sep 22, 2016

hi @jmaynier I was able to solve this using a combination of CONCURRENT_REQUESTS_PER_DOMAIN, DOWNLOAD_DELAY and download_slot.

To summarise, scrapy uses the domain of a url as the “key” to create a download_slot (which are in charge of download concurrency), that’s why we can set CONCURRENT_REQUESTS_PER_DOMAIN, because that way scrapy only controls the requests by slot.

Now, you can create your own slots, to setup custom concurrency, and assign the Request objects to that slot for the requests that you want to control with that custom concurrency, the way to do it is to pass it on the meta parameter like this: Request(url, meta={'download_slot': 'mycustomslot'}) (if you want more requests to be controlled by the same slot, just keep passing that meta parameter).

Now that you passed your custom slot, they will be still be controlled by the CONCURRENT_REQUESTS_PER_DOMAIN setting, even if that isn’t “actually” a domain.

Now what I did to control concurrency per “credential” is just to emulate the “domain” or “slot” behaviour but per credential, which resulted on specifying the following settings:

settings.py

CONCURRENT_REQUESTS=200 # a high number, just so it won't conflict with per-domain concurrency
CONCURRENT_REQUESTS_PER_DOMAIN=1 # this is saying do 1 request at a time per domain (and I will specify credentials as domains).
RANDOMIZE_DOWNLOAD_DELAY=False # just to deactivate random offset that scrapy adds.
DONWLOAD_DELAY=1.0 # The delay you want per credential, this says every 1 second, you can also specify decimals

Now when doing requests with your credentials, specify a unique identifier per credential (you could set the credentials in a list and use the list index) into the download_slot meta parameter and keep passing it on all the requests you want to do with each credential, scrapy will take care of the concurrency of the requests per credential.

NOTE: If you need to still change something in the request before scrapy really executes it (downloads it from the site), use a DOWNLOADER MIDDLEWARE, specifically in the process_request method and change the request.

0reactions

jmayniercommented, Sep 22, 2016

@eLRuLL thanks for the detailed explanation !