Make requests via message queues
See original GitHub issueI’m trying to pass requests to the spider externally, via message queues, and keep it running forever.
I found some projects made by others but none of them work for the current version of scrapy, so I’m trying to fix the issues or just find a way to do it myself.
So far, I found that others got a reference to the scheduler from the spider
in the middleware, like
class MyMiddleware(object):
# [...]
def ensure_init(self, spider):
self.spider = spider
self.scheduler = spider.crawler.engine.slot.scheduler
# [...]
def process_response(self, request, response, spider):
self.ensure_init(spider)
return response
Then, in another, custom "Scheduler"
class
class MyScheduler(object):
# [...]
def open(self, spider):
self.spider = spider
# [...]
def next_request(self):
return self.spider.make_request(page)
And then, in the spider
class TestSpider(scrapy.Spider):
# [...]
def make_request(self, page):
# This works, i.e. prints "Making request"
logging.info("Making request")
yield scrapy.Request(url=page, callback=self.parse)
# [...]
def parse(self, response):
# This never gets printed
logging.info("Got response")
This is the simplest example I could put together, but you get the idea, the code in reality is very messy, which makes it harder to fix.
The issue is, that in theory it should work, although I don’t know when is that next_request
method called, but it does because it calls the make_request
method in the spider, the only problem is that it never gets to the parse
or callback method in the spider, I don’t know why.
I also tried connecting the spider to the message queue directly in the spider, which should work, but it doesn’t, for example
import pika
class TestSpider(scrapy.Spider):
rbmqrk = 'test'
rmq = pika.BlockingConnection(
pika.ConnectionParameters(host='localhost'))
# Init channel
rmqc = rmq.channel()
def __init__(self, *args, **kwargs):
super(TestSpider, self).__init__(*args, **kwargs)
self.rmqc.queue_declare(queue=self.rbmqrk)
def start_requests(self):
self.rmqc.basic_consume(self.callback, self.rbmqrk)
def callback(channel, method_frame, header_frame, body):
# This gets printed! It works up to here.
logging.info(body)
Up to there, everything is working fine, the body of the message received from the queue gets printed.
But, if we try to make or yield a request from the callback method in the spider, it won’t work, for example
class TestSpider(scrapy.Spider):
# [...] same as above...
def start_requests(self):
# Same as above
self.rmqc.basic_consume(self.callback, self.rbmqrk)
def callback(channel, method_frame, header_frame, body):
# This DOESN'T get printed
logging.info(body)
yield scrapy.Request(url=body.decode(), callback=self.parse)
# This wouldn't work either (already tried as well)
# return scrapy.Request(url=body.decode(), callback=self.parse)
def prase(self, response):
# We never get here, this doesn't get printer either
logging.info("Got response")
Unfortunately, yielding a generator or a request from the callback is not making a spider request.
As you can see, I’ve tried several things without luck, but all I need to do is to be able to make requests with the messages from the message queue, I’m not sure if there’s a bug in scrapy or there’s something I can fix in the code but I would love to have some input on this before I start digging deeper into the scrapy code myself.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:8 (1 by maintainers)
Top GitHub Comments
Hi @octohedron wondering if the Scrapy RT may be useful for your task. Sadly it’s not being actively maintained recently.
@octohedron After staying up 24hrs straight trying to work this out I decided to go with a redis que. See scrapy-redis. It was fairly easy to setup and uses redis as dedupe and scheduler(que) with persistence.
I set it up on a gcloud compute engine instance and setup redis-server as a dameon
Then started the spider with nohup (prevents logging out off ssh session from killing the spider)
Once spider is running just add seed urls to redis and it should start sending to your crawl instance that you started with nohup. Here’s how I did it.
Its been running for over 24hrs now and I have tested the peristence bit and it all seems to be working just as expected. I still need to find out what happens when the que is finished and you try adding more or how to setup parallel runs but will be digging more into that next. Hope this helps.