question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make requests via message queues

See original GitHub issue

I’m trying to pass requests to the spider externally, via message queues, and keep it running forever.

I found some projects made by others but none of them work for the current version of scrapy, so I’m trying to fix the issues or just find a way to do it myself.

So far, I found that others got a reference to the scheduler from the spider in the middleware, like

class MyMiddleware(object):
    # [...]
    def ensure_init(self, spider):
        self.spider = spider
        self.scheduler = spider.crawler.engine.slot.scheduler
    # [...]
    def process_response(self, request, response, spider):
        self.ensure_init(spider)
        return response

Then, in another, custom "Scheduler" class

class MyScheduler(object):
    # [...]
    def open(self, spider):
        self.spider = spider
    # [...]
    def next_request(self):
        return self.spider.make_request(page)

And then, in the spider

class TestSpider(scrapy.Spider):
    # [...]
    def make_request(self, page):
        # This works, i.e. prints "Making request"
        logging.info("Making request")
        yield scrapy.Request(url=page, callback=self.parse)
    # [...]
    def parse(self, response):
        # This never gets printed
        logging.info("Got response")

This is the simplest example I could put together, but you get the idea, the code in reality is very messy, which makes it harder to fix.

The issue is, that in theory it should work, although I don’t know when is that next_request method called, but it does because it calls the make_request method in the spider, the only problem is that it never gets to the parse or callback method in the spider, I don’t know why.

I also tried connecting the spider to the message queue directly in the spider, which should work, but it doesn’t, for example

import pika

class TestSpider(scrapy.Spider):
        rbmqrk = 'test'
        rmq = pika.BlockingConnection(
            pika.ConnectionParameters(host='localhost'))
        # Init channel
        rmqc = rmq.channel()

    def __init__(self, *args, **kwargs):
        super(TestSpider, self).__init__(*args, **kwargs)
        self.rmqc.queue_declare(queue=self.rbmqrk)

    def start_requests(self):
        self.rmqc.basic_consume(self.callback, self.rbmqrk)

    def callback(channel, method_frame, header_frame, body):
        # This gets printed! It works up to here.
        logging.info(body)

Up to there, everything is working fine, the body of the message received from the queue gets printed.

But, if we try to make or yield a request from the callback method in the spider, it won’t work, for example

class TestSpider(scrapy.Spider):
       # [...] same as above...

    def start_requests(self):
        # Same as above
        self.rmqc.basic_consume(self.callback, self.rbmqrk)

    def callback(channel, method_frame, header_frame, body):
        # This DOESN'T get printed
        logging.info(body)
        yield scrapy.Request(url=body.decode(), callback=self.parse)
        # This wouldn't work either (already tried as well)
        # return scrapy.Request(url=body.decode(), callback=self.parse)

    def prase(self, response):
        # We never get here, this doesn't get printer either
        logging.info("Got response")

Unfortunately, yielding a generator or a request from the callback is not making a spider request.

As you can see, I’ve tried several things without luck, but all I need to do is to be able to make requests with the messages from the message queue, I’m not sure if there’s a bug in scrapy or there’s something I can fix in the code but I would love to have some input on this before I start digging deeper into the scrapy code myself.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:1
  • Comments:8 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
OlgaChcommented, Nov 16, 2018

Hi @octohedron wondering if the Scrapy RT may be useful for your task. Sadly it’s not being actively maintained recently.

1reaction
nicksherroncommented, Nov 6, 2018

@octohedron After staying up 24hrs straight trying to work this out I decided to go with a redis que. See scrapy-redis. It was fairly easy to setup and uses redis as dedupe and scheduler(que) with persistence.

I set it up on a gcloud compute engine instance and setup redis-server as a dameon

$ redis-server --daemonize yes

Then started the spider with nohup (prevents logging out off ssh session from killing the spider)

$ nohup scrapy crawl myspider -o out.jl

Once spider is running just add seed urls to redis and it should start sending to your crawl instance that you started with nohup. Here’s how I did it.

$ cat urlseed.txt | awk '{print "LPUSH myspider:start_urls", NR, "$0}' | redis-cli --pipe

Its been running for over 24hrs now and I have tested the peristence bit and it all seems to be working just as expected. I still need to find out what happens when the que is finished and you try adding more or how to setup parallel runs but will be digging more into that next. Hope this helps.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Beginners Guide to Message Queues: Benefits, 2 Types ...
Become an expert in Message Queues in minutes. This guide covers the fundamentals, benefits, types, and use-cases of Message Queues, ...
Read more >
An Introduction to Asynchronous Processing and Message ...
A message queue is a component that buffers and distributes asynchronous requests. A message is a piece of data that can either be...
Read more >
Message queues : The right way to process and transform ...
Queues fan out work by releasing each message only once​​ Therefore, each worker can process the data it receives without having to worry...
Read more >
RabbitMQ Use cases: Explaining message queues and when ...
The benefits of working with either message queueing service are obvious. Cloud-based messaging via CloudAMQP increases the speed of making the ...
Read more >
What are Benefits of Message Queues? - Amazon AWS
Message queues make it possible to scale precisely where you need to. When workloads peak, multiple instances of your application can all add...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found