[bug?] while True in start_requests(self): make scrapy is unable to consume the yields.
See original GitHub issueI’m doing
def start_requests(self):
while 1:
words = read_a_list_wanna_crawl()
ips = get_a_ip_list()
if words.count() > 0:
for _, __ in zip(words, ips):
print('do while')
yield scrapy.Request(processed_url, self.html_parse, meta={'proxy': ip, ...})
but when len(zip(words, ips)) == 1, scrapy print do while forever(Infinite loop) and never download any requests. but if len(zip(words, ips)) > 1, scrapy will not go in to infinite loop.
is this a bug? can scrapy handle this?
ps: (another way to solve this) Is it able to create a fake scrapy.Request() that don’t do request but do the callback to finish this kind control flow in scrapy?
Issue Analytics
- State:
- Created 5 years ago
- Comments:18 (11 by maintainers)
Top Results From Across the Web
Scrapy: Can't restart start_requests() properly
I want first to process the .js file, extract the coordinates, and then parse the main page and start crawling its links/parsing its...
Read more >Requests and Responses — Scrapy 2.7.1 documentation
a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP ......
Read more >Scrapy - Requests and Responses
Scrapy - Requests and Responses, Scrapy can crawl websites using the Request and ... def start_requests(self): for u in self.start_urls: yield scrapy.
Read more >Scrapy: This is how to successfully login with ease
To use it in our scrapy spider we have to import it first. from scrapy.http import FormRequest. Now instead of using start_url at...
Read more >Web Scraping With Selenium & Scrapy | by Karthikeyan P
The main drawback of Scrapy is its inability to natively handle ... of combining Selenium with Scrapy and makes use of Scrapy's Selector...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
A good asynchronous solution is to use the
spider_idle
signal to schedule in batches:when you yield a request, scrapy will put the request object into a schedule pool, and the scheduler will do this request concurrently when there is enough request objects or some tiny time is up.