question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support limiting the number of requests per interval

See original GitHub issue

Many Web site’s Open API limits the maximum number of requests in a certain interval from an IP address, Like 40 requests per minute. How ever, the current arguments are CONCURRENT_REQUESTS,CONCURRENT_REQUESTS_PER_DOMAIN CONCURRENT_REQUESTS_PER_IP, and DOWNLOAD_DELAY. Which depend on the duration of completing requests, so I feel difficult to adjust according to the threshold in API. To achieve high performance and don’t exceed the threshold of API, I suggests adding arguments like MAX_REQUESTS_PER_MINUTE.

Thanks!

Issue Analytics

  • State:closed
  • Created 11 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
eLRuLLcommented, Sep 22, 2016

hi @jmaynier I was able to solve this using a combination of CONCURRENT_REQUESTS_PER_DOMAIN, DOWNLOAD_DELAY and download_slot.

To summarise, scrapy uses the domain of a url as the “key” to create a download_slot (which are in charge of download concurrency), that’s why we can set CONCURRENT_REQUESTS_PER_DOMAIN, because that way scrapy only controls the requests by slot.

Now, you can create your own slots, to setup custom concurrency, and assign the Request objects to that slot for the requests that you want to control with that custom concurrency, the way to do it is to pass it on the meta parameter like this: Request(url, meta={'download_slot': 'mycustomslot'}) (if you want more requests to be controlled by the same slot, just keep passing that meta parameter).

Now that you passed your custom slot, they will be still be controlled by the CONCURRENT_REQUESTS_PER_DOMAIN setting, even if that isn’t “actually” a domain.

Now what I did to control concurrency per “credential” is just to emulate the “domain” or “slot” behaviour but per credential, which resulted on specifying the following settings:

settings.py

CONCURRENT_REQUESTS=200 # a high number, just so it won't conflict with per-domain concurrency
CONCURRENT_REQUESTS_PER_DOMAIN=1 # this is saying do 1 request at a time per domain (and I will specify credentials as domains).
RANDOMIZE_DOWNLOAD_DELAY=False # just to deactivate random offset that scrapy adds.
DONWLOAD_DELAY=1.0 # The delay you want per credential, this says every 1 second, you can also specify decimals

Now when doing requests with your credentials, specify a unique identifier per credential (you could set the credentials in a list and use the list index) into the download_slot meta parameter and keep passing it on all the requests you want to do with each credential, scrapy will take care of the concurrency of the requests per credential.

NOTE: If you need to still change something in the request before scrapy really executes it (downloads it from the site), use a DOWNLOADER MIDDLEWARE, specifically in the process_request method and change the request.

0reactions
jmayniercommented, Sep 22, 2016

@eLRuLL thanks for the detailed explanation !

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to handle API rate limits: Do your integrations work at ...
An API rate limit might enforce, say, 100 requests per minute. Once requests exceed that number, it generates an error message to alert...
Read more >
Rate limits | Docs | Twitter Developer Platform
The maximum number of requests that are allowed is based on a time interval, some specified period or window of time. The most...
Read more >
Best Practices for API Rate Limits and Quotas with Moesif to ...
Both quotas and rate limits work by tracking the number of requests each API user makes within a defined time interval and then...
Read more >
Everything You Need To Know About API Rate Limiting
This rate-limiting library automatically limits the number of requests that can be sent to an API. It also sets up the request queue ......
Read more >
Rate-based rule statement - AWS Documentation
You set the limit as the number of requests per 5-minute time span. You can use this type of rule to put a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found