question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Docker Quickstart Error for Crawler container

See original GitHub issue

Hi guys, Thanks for this amazing project!

I’ve some trouble setting up a dockerized cluster. I am following this quickstart step by step but - only into scrapycluster_crawler_1 - the online test does not pass properly.

My local setup:

MacBook Pro - High Sierra
Docker4Mac Version 17.12.0-ce-mac47 (21805)

Below full console output:

root@e5bea50cbe71:/usr/src/app# ./run_docker_tests.sh
/usr/src/app/crawling/distributed_scheduler.py:8: ScrapyDeprecationWarning: Module `scrapy.conf` is deprecated, use `crawler.settings` attribute instead
  from scrapy.conf import settings
test_change_config (test_distributed_scheduler.TestDistributedSchedulerChangeConfig) ... ok
test_create_queues (test_distributed_scheduler.TestDistributedSchedulerCreateQueues) ... ok
test_enqueue_request (test_distributed_scheduler.TestDistributedSchedulerEnqueueRequest) ... /usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:306: SystemTimeWarning: System time is way off (before 2016-01-01). This will probably lead to SSL verification errors
  SystemTimeWarning
ok
test_error_config (test_distributed_scheduler.TestDistributedSchedulerErrorConfig) ... ok
test_expire_queues (test_distributed_scheduler.TestDistributedSchedulerExpireQueues) ... ok
test_find_item (test_distributed_scheduler.TestDistributedSchedulerFindItem) ... ok
test_fit_scale (test_distributed_scheduler.TestDistributedSchedulerFitScale) ... ok
test_load_domain_config (test_distributed_scheduler.TestDistributedSchedulerLoadDomainConfig) ... ok
test_next_request (test_distributed_scheduler.TestDistributedSchedulerNextRequest) ... ok
test_parse_cookie (test_distributed_scheduler.TestDistributedSchedulerParseCookie) ... ok
test_update_domain_queues (test_distributed_scheduler.TestDistributedSchedulerUpdateDomainQueues) ... ok
test_link_spider_parse (test_link_spider.TestLinkSpider) ... ok
/usr/src/app/crawling/log_retry_middleware.py:10: ScrapyDeprecationWarning: Importing from scrapy.xlib.tx is deprecated and will no longer be supported in future Scrapy versions. Update your code to import from twisted proper.
  from scrapy.xlib.tx import ResponseFailed
test_lrm_stats_setup (test_log_retry_middleware.TestLogRetryMiddlewareStats) ... ok
test_mpm_middleware (test_meta_passthrough_middleware.TestMetaPassthroughMiddleware) ... ok
test_process_item (test_pipelines.TestKafkaPipeline) ... ok
test_process_item (test_pipelines.TestLoggingBeforePipeline) ... ok
test_dupe_filter (test_redis_dupefilter.TestRedisDupefilter) ... ok
test_retries (test_redis_retry_middleware.TestRedisRetryMiddleware) ... ok
test_load_stats_codes (test_redis_stats_middleware.TestRedisStatsMiddleware) ... ok
test_rsm_input (test_redis_stats_middleware.TestRedisStatsMiddleware) ... ok
test_link_spider_parse (test_wandering_spider.TestWanderingSpider) ... ok

----------------------------------------------------------------------
Ran 21 tests in 4.665s

OK
/usr/src/app/crawling/spiders/link_spider.py:6: ScrapyDeprecationWarning: Module `scrapy.conf` is deprecated, use `crawler.settings` attribute instead
  from scrapy.conf import settings
test_crawler_process (__main__.TestLinkSpider) ... /usr/src/app/crawling/log_retry_middleware.py:10: ScrapyDeprecationWarning: Importing from scrapy.xlib.tx is deprecated and will no longer be supported in future Scrapy versions. Update your code to import from twisted proper.
  from scrapy.xlib.tx import ResponseFailed
2018-01-16 22:30:13,058 [sc-crawler] INFO: Changed Public IP: None -> b'87.4.65.220'
ERROR

======================================================================
ERROR: test_crawler_process (__main__.TestLinkSpider)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/online.py", line 92, in test_crawler_process
    m = next(self.consumer)
  File "/usr/local/lib/python2.7/site-packages/future/builtins/newnext.py", line 65, in newnext
    raise e
StopIteration

----------------------------------------------------------------------
Ran 1 test in 36.955s

FAILED (errors=1)
integration tests failed

Thanks!

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:6

github_iconTop GitHub Comments

5reactions
madisonbcommented, Jan 19, 2018

Actually I may have found the issue, I will be releasing a 1.2.1 hotfix hopefully today. Use a different website to execute the crawl, like http://dmoztools.net. Your crawler should be working fine, but the new IST Research website appears to cause issues with javascript inside of the scraper.

All of the integration tests passed here when I changed the urls at this commit.

0reactions
edolixcommented, Jan 19, 2018

Thanks @madisonb !!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Fix and Debug Docker Containers Like a Superhero
Container errors are tricky to diagnose, but some investigative magic works wonders. Read along to learn how to debug Docker containers.
Read more >
Quick Start — Scrapy Cluster 1.3 documentation
The Docker Quickstart will help you spin up a complete standalone cluster ... At the time of writing, there is no Docker container...
Read more >
How to get docker toolbox to work with .net core 2.0 project
I have tried running this executable, and it seems to be working. My containers are running, but the error for Visual Studio Container...
Read more >
Run Enterprise Search server using Docker images - Elastic
Run Enterprise Search using docker run edit. Use docker run to manage Elastic containers imperatively. Enterprise Search depends on Elasticsearch and Kibana.
Read more >
How to install Docker on Windows behind a proxy
Head over to the Docker Toolbox page to grab the install. ... Error creating machine: Error in driver during machine creation: This computer ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found