Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Docker Quickstart Error for Crawler container

See original GitHub issue

Hi guys, Thanks for this amazing project!

I’ve some trouble setting up a dockerized cluster. I am following this quickstart step by step but - only into scrapycluster_crawler_1 - the online test does not pass properly.

My local setup:

MacBook Pro - High Sierra
Docker4Mac Version 17.12.0-ce-mac47 (21805)

Below full console output:

root@e5bea50cbe71:/usr/src/app# ./run_docker_tests.sh
/usr/src/app/crawling/distributed_scheduler.py:8: ScrapyDeprecationWarning: Module `scrapy.conf` is deprecated, use `crawler.settings` attribute instead
  from scrapy.conf import settings
test_change_config (test_distributed_scheduler.TestDistributedSchedulerChangeConfig) ... ok
test_create_queues (test_distributed_scheduler.TestDistributedSchedulerCreateQueues) ... ok
test_enqueue_request (test_distributed_scheduler.TestDistributedSchedulerEnqueueRequest) ... /usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:306: SystemTimeWarning: System time is way off (before 2016-01-01). This will probably lead to SSL verification errors
  SystemTimeWarning
ok
test_error_config (test_distributed_scheduler.TestDistributedSchedulerErrorConfig) ... ok
test_expire_queues (test_distributed_scheduler.TestDistributedSchedulerExpireQueues) ... ok
test_find_item (test_distributed_scheduler.TestDistributedSchedulerFindItem) ... ok
test_fit_scale (test_distributed_scheduler.TestDistributedSchedulerFitScale) ... ok
test_load_domain_config (test_distributed_scheduler.TestDistributedSchedulerLoadDomainConfig) ... ok
test_next_request (test_distributed_scheduler.TestDistributedSchedulerNextRequest) ... ok
test_parse_cookie (test_distributed_scheduler.TestDistributedSchedulerParseCookie) ... ok
test_update_domain_queues (test_distributed_scheduler.TestDistributedSchedulerUpdateDomainQueues) ... ok
test_link_spider_parse (test_link_spider.TestLinkSpider) ... ok
/usr/src/app/crawling/log_retry_middleware.py:10: ScrapyDeprecationWarning: Importing from scrapy.xlib.tx is deprecated and will no longer be supported in future Scrapy versions. Update your code to import from twisted proper.
  from scrapy.xlib.tx import ResponseFailed
test_lrm_stats_setup (test_log_retry_middleware.TestLogRetryMiddlewareStats) ... ok
test_mpm_middleware (test_meta_passthrough_middleware.TestMetaPassthroughMiddleware) ... ok
test_process_item (test_pipelines.TestKafkaPipeline) ... ok
test_process_item (test_pipelines.TestLoggingBeforePipeline) ... ok
test_dupe_filter (test_redis_dupefilter.TestRedisDupefilter) ... ok
test_retries (test_redis_retry_middleware.TestRedisRetryMiddleware) ... ok
test_load_stats_codes (test_redis_stats_middleware.TestRedisStatsMiddleware) ... ok
test_rsm_input (test_redis_stats_middleware.TestRedisStatsMiddleware) ... ok
test_link_spider_parse (test_wandering_spider.TestWanderingSpider) ... ok

----------------------------------------------------------------------
Ran 21 tests in 4.665s

OK
/usr/src/app/crawling/spiders/link_spider.py:6: ScrapyDeprecationWarning: Module `scrapy.conf` is deprecated, use `crawler.settings` attribute instead
  from scrapy.conf import settings
test_crawler_process (__main__.TestLinkSpider) ... /usr/src/app/crawling/log_retry_middleware.py:10: ScrapyDeprecationWarning: Importing from scrapy.xlib.tx is deprecated and will no longer be supported in future Scrapy versions. Update your code to import from twisted proper.
  from scrapy.xlib.tx import ResponseFailed
2018-01-16 22:30:13,058 [sc-crawler] INFO: Changed Public IP: None -> b'87.4.65.220'
ERROR

======================================================================
ERROR: test_crawler_process (__main__.TestLinkSpider)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/online.py", line 92, in test_crawler_process
    m = next(self.consumer)
  File "/usr/local/lib/python2.7/site-packages/future/builtins/newnext.py", line 65, in newnext
    raise e
StopIteration

----------------------------------------------------------------------
Ran 1 test in 36.955s

FAILED (errors=1)
integration tests failed

Thanks!

Issue Analytics

State:
Created 6 years ago
Comments:6

Top GitHub Comments

5reactions

madisonbcommented, Jan 19, 2018

Actually I may have found the issue, I will be releasing a 1.2.1 hotfix hopefully today. Use a different website to execute the crawl, like http://dmoztools.net. Your crawler should be working fine, but the new IST Research website appears to cause issues with javascript inside of the scraper.

All of the integration tests passed here when I changed the urls at this commit.

0reactions

edolixcommented, Jan 19, 2018

Thanks @madisonb !!

Top Results From Across the Web

How to Fix and Debug Docker Containers Like a Superhero

Container errors are tricky to diagnose, but some investigative magic works wonders. Read along to learn how to debug Docker containers.

Quick Start — Scrapy Cluster 1.3 documentation

The Docker Quickstart will help you spin up a complete standalone cluster ... At the time of writing, there is no Docker container...

How to get docker toolbox to work with .net core 2.0 project

I have tried running this executable, and it seems to be working. My containers are running, but the error for Visual Studio Container...

Run Enterprise Search server using Docker images - Elastic

Run Enterprise Search using docker run edit. Use docker run to manage Elastic containers imperatively. Enterprise Search depends on Elasticsearch and Kibana.

How to install Docker on Windows behind a proxy

Head over to the Docker Toolbox page to grab the install. ... Error creating machine: Error in driver during machine creation: This computer ......