[autoscaler] KeyError when starting private cluster
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
- Ray installed from (source or binary): pip
- Ray version: 0.6.5
- Python version: 3.6.7
- Exact command to reproduce:
ray create-or-update cluster.yaml
Describe the problem
Source code / logs
I followed the documentation and modified example-full.yaml to fill in username, node IP addresses, and custom setup commands.
Traceback:
ray create-or-update cluster.yaml
/tmp/env/lib/python3.6/site-packages/ray/autoscaler/commands.py:38: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(open(config_file).read())
/tmp/env/lib/python3.6/site-packages/ray/autoscaler/node_provider.py:115: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
defaults = yaml.load(f)
2019-04-04 03:39:25,901 INFO node_provider.py:34 -- ClusterState: Loaded cluster state: {'c79.millennium.berkeley.edu': {'tags': {'ray-node-type': 'worker'}, 'state': 'terminated'}, 'c80.millennium.berkeley.edu': {'tags': {'ray-node-type': 'head', 'ray-launch-config': '6c51b8169c9469f0fa2568e5d238af2585d302a7', 'ray-node-name': 'ray-default-head'}, 'state': 'running'}}
2019-04-04 03:39:25,902 INFO node_provider.py:59 -- ClusterState: Writing cluster state: {'c79.millennium.berkeley.edu': {'tags': {'ray-node-type': 'worker'}, 'state': 'terminated'}, 'c80.millennium.berkeley.edu': {'tags': {'ray-node-type': 'head', 'ray-launch-config': '6c51b8169c9469f0fa2568e5d238af2585d302a7', 'ray-node-name': 'ray-default-head'}, 'state': 'running'}}
This will restart cluster services [y/N]: y
2019-04-04 03:39:29,888 INFO commands.py:202 -- get_or_create_head_node: Updating files on head node...
Traceback (most recent call last):
File "/tmp/env/bin/ray", line 11, in <module>
sys.exit(main())
File "/tmp/env/lib/python3.6/site-packages/ray/scripts/scripts.py", line 766, in main
return cli()
File "/tmp/env/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/tmp/env/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/tmp/env/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/tmp/env/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/tmp/env/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/tmp/env/lib/python3.6/site-packages/ray/scripts/scripts.py", line 460, in create_or_update
no_restart, restart_only, yes, cluster_name)
File "/tmp/env/lib/python3.6/site-packages/ray/autoscaler/commands.py", line 47, in create_or_update_cluster
override_cluster_name)
File "/tmp/env/lib/python3.6/site-packages/ray/autoscaler/commands.py", line 243, in get_or_create_head_node
initialization_commands=config["initialization_commands"],
KeyError: 'initialization_commands'
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:7 (2 by maintainers)
Top Results From Across the Web
Autoscaler failing on minikube - Kubernetes - Ray
Hello, I get the below exception with autoscaler: 2021-04-22 15:06:23806 ... /ray/autoscaler/_private/autoscaler.py”, line 140, in update
Read more >Ray cluster launch with yaml aws AttributeError - Stack Overflow
I am trying to launch the simplest version of an aws docker cluster launch possible for a proof of principle.
Read more >Autoscaling clusters | Dataproc Documentation - Google Cloud
An Autoscaling Policy is a reusable configuration that describes how cluster workers using the autoscaling policy should scale. It defines scaling boundaries, ...
Read more >Cannot get a Rancher cluster setup
Hello, I'm new to Docker/Rancher/Kubernetes in general. I'm setting up a POC for an internal team and they want to try and use...
Read more >Autoscaling in Nomad Cluster – DEVOPS DONE RIGHT - Blog
Since Kubernetes has its own method of autoscaling using the metrics-server, ... But just like we discussed in our previous blog on Running...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This particular assertion appears to be a separate problem, as I discovered the hard way. It seems to occur when the address of
head_ip
is also included inworker_ips
. Removing the head_ip from the worker list eliminated the error for me. I also found it necessary to delete the tmp/cluster-<name>.state file from broken runs to prevent errors a few lines later when it tries to the missing head_ip to the worker_ips.I’m getting the same error on the latest master. This line is causing the error.
Looks like the local example is out of date.