question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[bug?] while True in start_requests(self): make scrapy is unable to consume the yields.

See original GitHub issue

I’m doing

    def start_requests(self):
        while 1:
            words = read_a_list_wanna_crawl()
            ips = get_a_ip_list()
            if words.count() > 0:
                for _, __ in zip(words, ips):
                    print('do while')
                    yield scrapy.Request(processed_url, self.html_parse, meta={'proxy': ip, ...})

but when len(zip(words, ips)) == 1, scrapy print do while forever(Infinite loop) and never download any requests. but if len(zip(words, ips)) > 1, scrapy will not go in to infinite loop.

is this a bug? can scrapy handle this?

ps: (another way to solve this) Is it able to create a fake scrapy.Request() that don’t do request but do the callback to finish this kind control flow in scrapy?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:18 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
apalalacommented, Apr 20, 2019

A good asynchronous solution is to use the spider_idle signal to schedule in batches:

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_idle, signals.spider_idle)
        return spider

    def start_requests(self):
        yield from self.batch()

    def spider_idle(self):
        if self.done():
            return

        for req in self.batch():
            self.crawler.engine.schedule(req, self)

        raise DontCloseSpider
1reaction
kingnamecommented, May 15, 2018

when you yield a request, scrapy will put the request object into a schedule pool, and the scheduler will do this request concurrently when there is enough request objects or some tiny time is up.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy: Can't restart start_requests() properly
I want first to process the .js file, extract the coordinates, and then parse the main page and start crawling its links/parsing its...
Read more >
Requests and Responses — Scrapy 2.7.1 documentation
a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP ......
Read more >
Scrapy - Requests and Responses
Scrapy - Requests and Responses, Scrapy can crawl websites using the Request and ... def start_requests(self): for u in self.start_urls: yield scrapy.
Read more >
Scrapy: This is how to successfully login with ease
To use it in our scrapy spider we have to import it first. from scrapy.http import FormRequest. Now instead of using start_url at...
Read more >
Web Scraping With Selenium & Scrapy | by Karthikeyan P
The main drawback of Scrapy is its inability to natively handle ... of combining Selenium with Scrapy and makes use of Scrapy's Selector...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found