Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DFO works incorrectly

See original GitHub issue

Description

According to FAQ, scrapy should crawl in DFO order especially when lowering the CONCURRENT_REQUESTS to 1. But in fact, it doesn’t.

Steps to Reproduce

There pastes a snippet:

from scrapy import Spider, Request


class ExampleSpider(Spider):
    name = 'example'

    custom_settings = {
        'CONCURRENT_REQUESTS': 1,
        # 'DOWNLOAD_DELAY': 10,
    }

    def start_requests(self):
        yield Request(f'http://httpbin.org/get', callback=self.parseA)

    def parseA(self, response):
        for i in range(5):
            yield Request(f'http://httpbin.org/get?query=A{i}', callback=self.parseB)

    def parseB(self, response):
        source_query = response.json()['args']['query']

        target_query = source_query.replace('A', 'B')

        yield Request(f'http://httpbin.org/get?query={target_query}')

    def parse(self, response):
        # there usually combine the items crawled from response A and B

        return

Expected behavior:

It should crawl in the following DFO order:

A4
B4
A3
B3
A2
B2
A1
B1
A0
B0

Actual behavior:

A4
A3
B4
B3
A2
A1
B2
B1
A0
B0

Reproduces how often:

It always reproduce the same order, even though I try a higher DOWNLOAD_DELAY hoping to delay the A requests.

Versions

2.4.1

Additional context

There are the reasons why the order is so important:

When crawling an API, it is neccessary to behave like human being (e.g. enter page A and then enter page B from page A), so the order matters.
What’s more, some endpoints are more easily get overloaded than the others, so you need to crawl each endpoint alternatively to balance the downloading time.

Maybe, you would say that I can yield request A and B in the same parse method to alternatively crawl the endpoints. But, some times, we need to combine the results from different endpoints, so we have to yield request A in parseA method and yield B in parseB method (because there is no inline request feature), finally combine the parsed results (through the cb_kwargs field) in the parse method.

I have tried downloader middleware to reschedule the request (by return it) which is of the same endpoint as the last seen one, but no effect.

After a lot of work, I couldn’t find a way to re-order the crawling. Shall we have a bug fix for this?

Issue Analytics

State:
Created 2 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

GeorgeA92commented, May 27, 2021

@codingpy Your case is… nearly identical to this stackoverflow question

When scrapy received response -> it schedules next request for execution (before request.callback of that request executed) Your application scheduled request A3 after A4 because at that moment scheduler had only requests. A0, A1, A2, A3. Your application didn’t know anything about request B4 because as at that moment it didn’t reached and executed parseB callback f-n received from first A4 request (as result request B4 not in scheduler queue yet at that moment).

I don’t think that this is a bug. I w`d say that this is… not obvious consequence of asynchronous… nature of Scrapy.

Questions like Scrapy doesn't.... request order regularly appears on stackoverflow [scrapy] tag.

0reactions

wRARcommented, Jun 11, 2021

If I understand correctly, it was decided that this is not a bug but an implementation detail, but please reopen if I’m wrong.

Top Results From Across the Web

Help! DFO not working - Reddit

found a fix for it. You just need to run the game in windows 8 mode. Do it through trouble shooting the launcher....

Something is wrong with my DFO - Dungeon Fighter ... - GameFAQs

DFO works on and off, but I can't get on more often than not. I'm going to try and do some different methods...

BUG FIXES - Dungeon Fighter Online

Shadow Dancer: The Perfect Assassin Talisman will no longer randomly apply incorrect Fatal Blitz Atk. when Dagger Throw is canceled into the skill....

User:Altair - DFO World Wiki

Not all the formats work for all table sizes, so you have to ... the data on this builder has a tendency to...

About the DFO Pay Team

The members of the DFO Pay Operations Support are your DFO colleagues working to support employees, managers, and timekeepers and ensure ...