DFO works incorrectly
See original GitHub issueDescription
According to FAQ, scrapy
should crawl in DFO order especially when lowering the CONCURRENT_REQUESTS
to 1. But in fact, it doesn’t.
Steps to Reproduce
There pastes a snippet:
from scrapy import Spider, Request
class ExampleSpider(Spider):
name = 'example'
custom_settings = {
'CONCURRENT_REQUESTS': 1,
# 'DOWNLOAD_DELAY': 10,
}
def start_requests(self):
yield Request(f'http://httpbin.org/get', callback=self.parseA)
def parseA(self, response):
for i in range(5):
yield Request(f'http://httpbin.org/get?query=A{i}', callback=self.parseB)
def parseB(self, response):
source_query = response.json()['args']['query']
target_query = source_query.replace('A', 'B')
yield Request(f'http://httpbin.org/get?query={target_query}')
def parse(self, response):
# there usually combine the items crawled from response A and B
return
Expected behavior:
It should crawl in the following DFO order:
A4
B4
A3
B3
A2
B2
A1
B1
A0
B0
Actual behavior:
A4
A3
B4
B3
A2
A1
B2
B1
A0
B0
Reproduces how often:
It always reproduce the same order, even though I try a higher DOWNLOAD_DELAY
hoping to delay the A requests.
Versions
2.4.1
Additional context
There are the reasons why the order is so important:
-
When crawling an API, it is neccessary to behave like human being (e.g. enter page
A
and then enter pageB
from pageA
), so the order matters. -
What’s more, some endpoints are more easily get overloaded than the others, so you need to crawl each endpoint alternatively to balance the downloading time.
Maybe, you would say that I can yield request A
and B
in the same parse method to alternatively crawl the endpoints. But, some times, we need to combine the results from different endpoints, so we have to yield request A
in parseA
method and yield B
in parseB
method (because there is no inline request feature), finally combine the parsed results (through the cb_kwargs
field) in the parse
method.
I have tried downloader middleware
to reschedule the request (by return it) which is of the same endpoint as the last seen one, but no effect.
After a lot of work, I couldn’t find a way to re-order the crawling. Shall we have a bug fix for this?
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
Top GitHub Comments
@codingpy Your case is… nearly identical to this stackoverflow question
When scrapy received response -> it schedules next request for execution (before
request.callback
of that request executed) Your application scheduled requestA3
afterA4
because at that moment scheduler had only requests.A0, A1, A2, A3
. Your application didn’t know anything about requestB4
because as at that moment it didn’t reached and executedparseB
callback f-n received from firstA4
request (as result requestB4
not in scheduler queue yet at that moment).I don’t think that this is a bug. I w`d say that this is… not obvious consequence of asynchronous… nature of Scrapy.
Questions like
Scrapy doesn't.... request order
regularly appears on stackoverflow[scrapy]
tag.If I understand correctly, it was decided that this is not a bug but an implementation detail, but please reopen if I’m wrong.