question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DFO works incorrectly

See original GitHub issue

Description

According to FAQ, scrapy should crawl in DFO order especially when lowering the CONCURRENT_REQUESTS to 1. But in fact, it doesn’t.

Steps to Reproduce

There pastes a snippet:

from scrapy import Spider, Request


class ExampleSpider(Spider):
    name = 'example'

    custom_settings = {
        'CONCURRENT_REQUESTS': 1,
        # 'DOWNLOAD_DELAY': 10,
    }

    def start_requests(self):
        yield Request(f'http://httpbin.org/get', callback=self.parseA)

    def parseA(self, response):
        for i in range(5):
            yield Request(f'http://httpbin.org/get?query=A{i}', callback=self.parseB)

    def parseB(self, response):
        source_query = response.json()['args']['query']

        target_query = source_query.replace('A', 'B')

        yield Request(f'http://httpbin.org/get?query={target_query}')

    def parse(self, response):
        # there usually combine the items crawled from response A and B

        return

Expected behavior:

It should crawl in the following DFO order:

A4
B4
A3
B3
A2
B2
A1
B1
A0
B0

Actual behavior:

A4
A3
B4
B3
A2
A1
B2
B1
A0
B0

Reproduces how often:

It always reproduce the same order, even though I try a higher DOWNLOAD_DELAY hoping to delay the A requests.

Versions

2.4.1

Additional context

There are the reasons why the order is so important:

  • When crawling an API, it is neccessary to behave like human being (e.g. enter page A and then enter page B from page A), so the order matters.

  • What’s more, some endpoints are more easily get overloaded than the others, so you need to crawl each endpoint alternatively to balance the downloading time.

Maybe, you would say that I can yield request A and B in the same parse method to alternatively crawl the endpoints. But, some times, we need to combine the results from different endpoints, so we have to yield request A in parseA method and yield B in parseB method (because there is no inline request feature), finally combine the parsed results (through the cb_kwargs field) in the parse method.

I have tried downloader middleware to reschedule the request (by return it) which is of the same endpoint as the last seen one, but no effect.

After a lot of work, I couldn’t find a way to re-order the crawling. Shall we have a bug fix for this?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
GeorgeA92commented, May 27, 2021

@codingpy Your case is… nearly identical to this stackoverflow question

When scrapy received response -> it schedules next request for execution (before request.callback of that request executed) Your application scheduled request A3 after A4 because at that moment scheduler had only requests. A0, A1, A2, A3. Your application didn’t know anything about request B4 because as at that moment it didn’t reached and executed parseB callback f-n received from first A4 request (as result request B4 not in scheduler queue yet at that moment).

I don’t think that this is a bug. I w`d say that this is… not obvious consequence of asynchronous… nature of Scrapy.

Questions like Scrapy doesn't.... request order regularly appears on stackoverflow [scrapy] tag.

0reactions
wRARcommented, Jun 11, 2021

If I understand correctly, it was decided that this is not a bug but an implementation detail, but please reopen if I’m wrong.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Help! DFO not working - Reddit
found a fix for it. You just need to run the game in windows 8 mode. Do it through trouble shooting the launcher....
Read more >
Something is wrong with my DFO - Dungeon Fighter ... - GameFAQs
DFO works on and off, but I can't get on more often than not. I'm going to try and do some different methods...
Read more >
BUG FIXES - Dungeon Fighter Online
Shadow Dancer: The Perfect Assassin Talisman will no longer randomly apply incorrect Fatal Blitz Atk. when Dagger Throw is canceled into the skill....
Read more >
User:Altair - DFO World Wiki
Not all the formats work for all table sizes, so you have to ... the data on this builder has a tendency to...
Read more >
About the DFO Pay Team
The members of the DFO Pay Operations Support are your DFO colleagues working to support employees, managers, and timekeepers and ensure ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found