question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trouble when adding a new spider

See original GitHub issue

Hi, thanks a lot for this project, it is great. Nevertheless I have a lot of trouble getting my head around all this complex stack.

I can run all the example from the quick start but I want to do more and I want to learn more about the internal of scrapy-cluster.

First of all, I would like to add my own spider, I will start with a very basic one:

class BasicSpider(RedisSpider):
    name = "manual"

    def __init__(self, *args, **kwargs):
        super(BasicSpider, self).__init__(*args, **kwargs)

    def parse(self, response):
        # Get the next index URLs and yield Requests
        print response
        next_selector = response.xpath('//*[contains(@class,"next")]//@href')
        for url in next_selector.extract():
            yield Request(urlparse.urljoin(response.url, url))

        # Get item URLs and yield Requests
        item_selector = response.xpath('//*[@itemprop="url"]/@href')
        for url in item_selector.extract():
            yield Request(urlparse.urljoin(response.url, url),
                          callback=self.parse_item)

    def parse_item(self, response):
     ....

When I run the command (I am using a local a website that is running on localhost for test purpose)

(sc) vagrant@scdev:/vagrant/kafka-monitor$ python kafka_monitor.py feed '{"url": "http://localhost:9312/properties/index_00000.html", "appid":"testapp1", "crawlid":"abc123", "spiderid":"manual"}'
2016-09-18 05:58:00,448 [kafka-monitor] DEBUG: Logging to stdout
2016-09-18 05:58:00,450 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
    "url": "http://localhost:9312/properties/index_00000.html",
    "spiderid": "manual",
    "crawlid": "abc123",
    "appid": "testapp1"
}
No handlers could be found for logger "kafka.producer"
2016-09-18 05:58:00,453 [kafka-monitor] INFO: Successfully fed item to Kafka

I have the kafka-monitor saying

2016-09-18 06:05:42,667 [kafka-monitor] DEBUG: Incremented total stats
2016-09-18 06:05:42,669 [kafka-monitor] DEBUG: Incremented plugin 'ScraperHandler' plugin stats
2016-09-18 06:05:42,670 [kafka-monitor] INFO: Added crawl to Redis

and get this error.

(sc) vagrant@scdev:/vagrant/crawler$ scrapy runspider crawling/spiders/manual.py
2016-09-18 05:58:06,864 [sc-crawler] INFO: Changed Public IP: None -> 121.44.110.101
<200 http://localhost:9312/properties/index_00000.html>
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object iter_errback at 0x7f2c2865c460> ignored
Unhandled error in Deferred:


Traceback (most recent call last):
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1194, in run
    self.mainLoop()
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1203, in mainLoop
    self.runUntilCurrent()
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/base.py", line 825, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/task.py", line 645, in _tick
    taskObj._oneWorkUnit()
--- <exception caught here> ---
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/task.py", line 491, in _oneWorkUnit
    result = next(self._iterator)
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 63, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/scrapy/core/scraper.py", line 183, in _process_spidermw_output
    self.crawler.engine.crawl(request=output, spider=spider)
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 183, in crawl
    self.schedule(request, spider)
  File "/home/vagrant/sc/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 189, in schedule
    if not self.slot.scheduler.enqueue_request(request):
  File "/vagrant/crawler/crawling/distributed_scheduler.py", line 361, in enqueue_request
    if not request.dont_filter and self.dupefilter.request_seen(request):
  File "/vagrant/crawler/crawling/redis_dupefilter.py", line 24, in request_seen
    c_id = request.meta['crawlid']
exceptions.KeyError: 'crawlid'

Can someone help with that error?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
madisonbcommented, Oct 10, 2016

Closing, as this should be addressed with the two new middlewares.

0reactions
madisonbcommented, Apr 30, 2018

Hi @shenbakeshkishore,

You can view the documentation around the logger here.

On first glance, it appears like you need to add the log level to your command, like self._logger.info(<content>), but otherwise please raise a different issue on this project if your problem persists and you think it is a problem with the code in this repo.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How Do I Add Aircraft & Register My Spider? - Knowledge Base
1. Scrolling down the page, click on 'Add Another Spider'. Please see this article if you need to add a new aircraft on...
Read more >
Two times the SPIDER-MEN in the latest DOUBLE TROUBLE ...
THIS WEEK: Spider-Men Peter Parker and Miles Morales team up for the latest installment of the Double Trouble series!
Read more >
Marvel's Avengers: How To Find And Play As Spider-Man
If you have the latest version installed, you'll see Spider-Man on the main menu, hanging upside-down on the left side of the hero...
Read more >
'Marvel's Spider-Man' gets first PC patch with a "more ... - NME
A new Marvel's Spider-Man patch on PC has been released addressing a number of bugs, but some issues still remain including one that...
Read more >
How Andrew Garfield can solve Sony's biggest Spider-Man ...
... Andrew Garfield the third 'Amazing Spider-Man' film he deserves. Here's how that would solve some major Marvel problems for the studio.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found