Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make requests via message queues

See original GitHub issue

I’m trying to pass requests to the spider externally, via message queues, and keep it running forever.

I found some projects made by others but none of them work for the current version of scrapy, so I’m trying to fix the issues or just find a way to do it myself.

So far, I found that others got a reference to the scheduler from the spider in the middleware, like

class MyMiddleware(object):
    # [...]
    def ensure_init(self, spider):
        self.spider = spider
        self.scheduler = spider.crawler.engine.slot.scheduler
    # [...]
    def process_response(self, request, response, spider):
        self.ensure_init(spider)
        return response

Then, in another, custom "Scheduler" class

class MyScheduler(object):
    # [...]
    def open(self, spider):
        self.spider = spider
    # [...]
    def next_request(self):
        return self.spider.make_request(page)

And then, in the spider

class TestSpider(scrapy.Spider):
    # [...]
    def make_request(self, page):
        # This works, i.e. prints "Making request"
        logging.info("Making request")
        yield scrapy.Request(url=page, callback=self.parse)
    # [...]
    def parse(self, response):
        # This never gets printed
        logging.info("Got response")

This is the simplest example I could put together, but you get the idea, the code in reality is very messy, which makes it harder to fix.

The issue is, that in theory it should work, although I don’t know when is that next_request method called, but it does because it calls the make_request method in the spider, the only problem is that it never gets to the parse or callback method in the spider, I don’t know why.

I also tried connecting the spider to the message queue directly in the spider, which should work, but it doesn’t, for example

import pika

class TestSpider(scrapy.Spider):
        rbmqrk = 'test'
        rmq = pika.BlockingConnection(
            pika.ConnectionParameters(host='localhost'))
        # Init channel
        rmqc = rmq.channel()

    def __init__(self, *args, **kwargs):
        super(TestSpider, self).__init__(*args, **kwargs)
        self.rmqc.queue_declare(queue=self.rbmqrk)

    def start_requests(self):
        self.rmqc.basic_consume(self.callback, self.rbmqrk)

    def callback(channel, method_frame, header_frame, body):
        # This gets printed! It works up to here.
        logging.info(body)

Up to there, everything is working fine, the body of the message received from the queue gets printed.

But, if we try to make or yield a request from the callback method in the spider, it won’t work, for example

class TestSpider(scrapy.Spider):
       # [...] same as above...

    def start_requests(self):
        # Same as above
        self.rmqc.basic_consume(self.callback, self.rbmqrk)

    def callback(channel, method_frame, header_frame, body):
        # This DOESN'T get printed
        logging.info(body)
        yield scrapy.Request(url=body.decode(), callback=self.parse)
        # This wouldn't work either (already tried as well)
        # return scrapy.Request(url=body.decode(), callback=self.parse)

    def prase(self, response):
        # We never get here, this doesn't get printer either
        logging.info("Got response")

Unfortunately, yielding a generator or a request from the callback is not making a spider request.

As you can see, I’ve tried several things without luck, but all I need to do is to be able to make requests with the messages from the message queue, I’m not sure if there’s a bug in scrapy or there’s something I can fix in the code but I would love to have some input on this before I start digging deeper into the scrapy code myself.

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:8 (1 by maintainers)

Top GitHub Comments

1reaction

OlgaChcommented, Nov 16, 2018

Hi @octohedron wondering if the Scrapy RT may be useful for your task. Sadly it’s not being actively maintained recently.

1reaction

nicksherroncommented, Nov 6, 2018

@octohedron After staying up 24hrs straight trying to work this out I decided to go with a redis que. See scrapy-redis. It was fairly easy to setup and uses redis as dedupe and scheduler(que) with persistence.

I set it up on a gcloud compute engine instance and setup redis-server as a dameon

$ redis-server --daemonize yes

Then started the spider with nohup (prevents logging out off ssh session from killing the spider)

$ nohup scrapy crawl myspider -o out.jl

Once spider is running just add seed urls to redis and it should start sending to your crawl instance that you started with nohup. Here’s how I did it.

$ cat urlseed.txt | awk '{print "LPUSH myspider:start_urls", NR, "$0}' | redis-cli --pipe

Its been running for over 24hrs now and I have tested the peristence bit and it all seems to be working just as expected. I still need to find out what happens when the que is finished and you try adding more or how to setup parallel runs but will be digging more into that next. Hope this helps.