Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[bug?] while True in start_requests(self): make scrapy is unable to consume the yields.

See original GitHub issue

I’m doing

    def start_requests(self):
        while 1:
            words = read_a_list_wanna_crawl()
            ips = get_a_ip_list()
            if words.count() > 0:
                for _, __ in zip(words, ips):
                    print('do while')
                    yield scrapy.Request(processed_url, self.html_parse, meta={'proxy': ip, ...})

but when len(zip(words, ips)) == 1, scrapy print do while forever(Infinite loop) and never download any requests. but if len(zip(words, ips)) > 1, scrapy will not go in to infinite loop.

is this a bug? can scrapy handle this?

ps: (another way to solve this) Is it able to create a fake scrapy.Request() that don’t do request but do the callback to finish this kind control flow in scrapy?

Issue Analytics

State:
Created 5 years ago
Comments:18 (11 by maintainers)

Top GitHub Comments

1reaction

apalalacommented, Apr 20, 2019

A good asynchronous solution is to use the spider_idle signal to schedule in batches:

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_idle, signals.spider_idle)
        return spider

    def start_requests(self):
        yield from self.batch()

    def spider_idle(self):
        if self.done():
            return

        for req in self.batch():
            self.crawler.engine.schedule(req, self)

        raise DontCloseSpider

1reaction

kingnamecommented, May 15, 2018

when you yield a request, scrapy will put the request object into a schedule pool, and the scheduler will do this request concurrently when there is enough request objects or some tiny time is up.

Top Results From Across the Web

Scrapy: Can't restart start_requests() properly

I want first to process the .js file, extract the coordinates, and then parse the main page and start crawling its links/parsing its...

Requests and Responses — Scrapy 2.7.1 documentation

a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP ......

Scrapy - Requests and Responses

Scrapy - Requests and Responses, Scrapy can crawl websites using the Request and ... def start_requests(self): for u in self.start_urls: yield scrapy.

Scrapy: This is how to successfully login with ease

To use it in our scrapy spider we have to import it first. from scrapy.http import FormRequest. Now instead of using start_url at...

Web Scraping With Selenium & Scrapy | by Karthikeyan P

The main drawback of Scrapy is its inability to natively handle ... of combining Selenium with Scrapy and makes use of Scrapy's Selector...