Trouble when adding a new spider
See original GitHub issueHi, thanks a lot for this project, it is great. Nevertheless I have a lot of trouble getting my head around all this complex stack.
I can run all the example from the quick start but I want to do more and I want to learn more about the internal of scrapy-cluster.
First of all, I would like to add my own spider, I will start with a very basic one:
class BasicSpider(RedisSpider):
name = "manual"
def __init__(self, *args, **kwargs):
super(BasicSpider, self).__init__(*args, **kwargs)
def parse(self, response):
# Get the next index URLs and yield Requests
print response
next_selector = response.xpath('//*[contains(@class,"next")]//@href')
for url in next_selector.extract():
yield Request(urlparse.urljoin(response.url, url))
# Get item URLs and yield Requests
item_selector = response.xpath('//*[@itemprop="url"]/@href')
for url in item_selector.extract():
yield Request(urlparse.urljoin(response.url, url),
callback=self.parse_item)
def parse_item(self, response):
....
When I run the command (I am using a local a website that is running on localhost for test purpose)
(sc) vagrant@scdev:/vagrant/kafka-monitor$ python kafka_monitor.py feed '{"url": "http://localhost:9312/properties/index_00000.html", "appid":"testapp1", "crawlid":"abc123", "spiderid":"manual"}'
2016-09-18 05:58:00,448 [kafka-monitor] DEBUG: Logging to stdout
2016-09-18 05:58:00,450 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
"url": "http://localhost:9312/properties/index_00000.html",
"spiderid": "manual",
"crawlid": "abc123",
"appid": "testapp1"
}
No handlers could be found for logger "kafka.producer"
2016-09-18 05:58:00,453 [kafka-monitor] INFO: Successfully fed item to Kafka
I have the kafka-monitor saying
2016-09-18 06:05:42,667 [kafka-monitor] DEBUG: Incremented total stats
2016-09-18 06:05:42,669 [kafka-monitor] DEBUG: Incremented plugin 'ScraperHandler' plugin stats
2016-09-18 06:05:42,670 [kafka-monitor] INFO: Added crawl to Redis
and get this error.
(sc) vagrant@scdev:/vagrant/crawler$ scrapy runspider crawling/spiders/manual.py
2016-09-18 05:58:06,864 [sc-crawler] INFO: Changed Public IP: None -> 121.44.110.101
<200 http://localhost:9312/properties/index_00000.html>
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object iter_errback at 0x7f2c2865c460> ignored
Unhandled error in Deferred:
Traceback (most recent call last):
File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1194, in run
self.mainLoop()
File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1203, in mainLoop
self.runUntilCurrent()
File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/base.py", line 825, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/task.py", line 645, in _tick
taskObj._oneWorkUnit()
--- <exception caught here> ---
File "/home/vagrant/sc/local/lib/python2.7/site-packages/twisted/internet/task.py", line 491, in _oneWorkUnit
result = next(self._iterator)
File "/home/vagrant/sc/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 63, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
File "/home/vagrant/sc/local/lib/python2.7/site-packages/scrapy/core/scraper.py", line 183, in _process_spidermw_output
self.crawler.engine.crawl(request=output, spider=spider)
File "/home/vagrant/sc/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 183, in crawl
self.schedule(request, spider)
File "/home/vagrant/sc/local/lib/python2.7/site-packages/scrapy/core/engine.py", line 189, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/vagrant/crawler/crawling/distributed_scheduler.py", line 361, in enqueue_request
if not request.dont_filter and self.dupefilter.request_seen(request):
File "/vagrant/crawler/crawling/redis_dupefilter.py", line 24, in request_seen
c_id = request.meta['crawlid']
exceptions.KeyError: 'crawlid'
Can someone help with that error?
Issue Analytics
- State:
- Created 7 years ago
- Comments:6
Top Results From Across the Web
How Do I Add Aircraft & Register My Spider? - Knowledge Base
1. Scrolling down the page, click on 'Add Another Spider'. Please see this article if you need to add a new aircraft on...
Read more >Two times the SPIDER-MEN in the latest DOUBLE TROUBLE ...
THIS WEEK: Spider-Men Peter Parker and Miles Morales team up for the latest installment of the Double Trouble series!
Read more >Marvel's Avengers: How To Find And Play As Spider-Man
If you have the latest version installed, you'll see Spider-Man on the main menu, hanging upside-down on the left side of the hero...
Read more >'Marvel's Spider-Man' gets first PC patch with a "more ... - NME
A new Marvel's Spider-Man patch on PC has been released addressing a number of bugs, but some issues still remain including one that...
Read more >How Andrew Garfield can solve Sony's biggest Spider-Man ...
... Andrew Garfield the third 'Amazing Spider-Man' film he deserves. Here's how that would solve some major Marvel problems for the studio.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Closing, as this should be addressed with the two new middlewares.
Hi @shenbakeshkishore,
You can view the documentation around the logger here.
On first glance, it appears like you need to add the log level to your command, like
self._logger.info(<content>)
, but otherwise please raise a different issue on this project if your problem persists and you think it is a problem with the code in this repo.