Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trouble when adding a new spider

See original GitHub issue

Hi, thanks a lot for this project, it is great. Nevertheless I have a lot of trouble getting my head around all this complex stack.

I can run all the example from the quick start but I want to do more and I want to learn more about the internal of scrapy-cluster.

First of all, I would like to add my own spider, I will start with a very basic one:

class BasicSpider(RedisSpider):
    name = "manual"

    def __init__(self, *args, **kwargs):
        super(BasicSpider, self).__init__(*args, **kwargs)

    def parse(self, response):
        # Get the next index URLs and yield Requests
        print response
        next_selector = response.xpath('//*[contains(@class,"next")]//@href')
        for url in next_selector.extract():
            yield Request(urlparse.urljoin(response.url, url))

        # Get item URLs and yield Requests
        item_selector = response.xpath('//*[@itemprop="url"]/@href')
        for url in item_selector.extract():
            yield Request(urlparse.urljoin(response.url, url),
                          callback=self.parse_item)

    def parse_item(self, response):
     ....

When I run the command (I am using a local a website that is running on localhost for test purpose)

(sc) vagrant@scdev:/vagrant/kafka-monitor$ python kafka_monitor.py feed '{"url": "http://localhost:9312/properties/index_00000.html", "appid":"testapp1", "crawlid":"abc123", "spiderid":"manual"}'
2016-09-18 05:58:00,448 [kafka-monitor] DEBUG: Logging to stdout
2016-09-18 05:58:00,450 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
    "url": "http://localhost:9312/properties/index_00000.html",
    "spiderid": "manual",
    "crawlid": "abc123",
    "appid": "testapp1"
}
No handlers could be found for logger "kafka.producer"
2016-09-18 05:58:00,453 [kafka-monitor] INFO: Successfully fed item to Kafka

I have the kafka-monitor saying

2016-09-18 06:05:42,667 [kafka-monitor] DEBUG: Incremented total stats
2016-09-18 06:05:42,669 [kafka-monitor] DEBUG: Incremented plugin 'ScraperHandler' plugin stats
2016-09-18 06:05:42,670 [kafka-monitor] INFO: Added crawl to Redis

and get this error.

(sc) vagrant@scdev:/vagrant/crawler$ scrapy runspider crawling/spiders/manual.py
2016-09-18 05:58:06,864 [sc-crawler] INFO: Changed Public IP: None -> 121.44.110.101
<200 http://localhost:9312/properties/index_00000.html>
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object iter_errback at 0x7f2c2865c460> ignored
Unhandled error in Deferred:


Traceback (most recent call last):
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1194, in run
    self.mainLoop()
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1203, in mainLoop
    self.runUntilCurrent()
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/base.py", line 825, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/task.py", line 645, in _tick
    taskObj._oneWorkUnit()
--- <exception caught here> ---
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/task.py", line 491, in _oneWorkUnit
    result = next(self._iterator)
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 63, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/scrapy/core/scraper.py", line 183, in _process_spidermw_output
    self.crawler.engine.crawl(request=output, spider=spider)
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 183, in crawl
    self.schedule(request, spider)
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 189, in schedule
    if not self.slot.scheduler.enqueue_request(request):
  File "/vagrant/crawler/crawling/distributed_scheduler.py", line 361, in enqueue_request
    if not request.dont_filter and self.dupefilter.request_seen(request):
  File "/vagrant/crawler/crawling/redis_dupefilter.py", line 24, in request_seen
    c_id = request.meta['crawlid']
exceptions.KeyError: 'crawlid'

Can someone help with that error?

Issue Analytics

State:
Created 7 years ago
Comments:6

Top GitHub Comments

1reaction

madisonbcommented, Oct 10, 2016

Closing, as this should be addressed with the two new middlewares.

0reactions

madisonbcommented, Apr 30, 2018

Hi @shenbakeshkishore,

You can view the documentation around the logger here.

On first glance, it appears like you need to add the log level to your command, like self._logger.info(<content>), but otherwise please raise a different issue on this project if your problem persists and you think it is a problem with the code in this repo.