[Bug] Sync execution failed on a GCP cluster
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Clusters RLlib
What happened + What you expected to happen
While running this example on a GCP cluster with one head node and two worker nodes, this error appears: googleapiclient.errors.HttpError: <HttpError 403 when requesting https://cloudresourcemanager.googleapis.com/v1/projects/<my-project-id>:getIamPolicy?alt=json returned “The caller does not have permission”>
(run pid=939) 2022-01-19 03:04:35,849 INFO commands.py:293 -- Checking GCP environment settings
(run pid=939) 2022-01-19 03:04:35,857 WARN util.py:137 -- The `head_node` field is deprecated and will be ignored. Use `head_node_type` and `available_node_types` instead.
(run pid=939) 2022-01-19 03:04:36,533 ERROR syncer.py:254 -- Sync execution failed.
(run pid=939) Traceback (most recent call last):
(run pid=939) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 251, in sync_down
(run pid=939) self._remote_path, self._local_dir, exclude=exclude)
(run pid=939) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/integration/docker.py", line 140, in sync_down
(run pid=939) use_internal_ip=True)
(run pid=939) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/sdk.py", line 150, in rsync
(run pid=939) should_bootstrap=should_bootstrap)
(run pid=939) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1079, in rsync
(run pid=939) config = _bootstrap_config(config, no_config_cache=no_config_cache)
(run pid=939) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 316, in _bootstrap_config
(run pid=939) resolved_config = provider_cls.bootstrap_config(config)
(run pid=939) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 187, in bootstrap_config
(run pid=939) return bootstrap_gcp(cluster_config)
(run pid=939) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/gcp/config.py", line 272, in bootstrap_gcp
(run pid=939) config = _configure_iam_role(config, crm, iam)
(run pid=939) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/gcp/config.py", line 342, in _configure_iam_role
(run pid=939) _add_iam_policy_binding(service_account, roles, crm)
(run pid=939) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/gcp/config.py", line 582, in _add_iam_policy_binding
(run pid=939) resource=project_id, body={}).execute()
(run pid=939) File "/home/ray/anaconda3/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
(run pid=939) return wrapped(*args, **kwargs)
(run pid=939) File "/home/ray/anaconda3/lib/python3.7/site-packages/googleapiclient/http.py", line 851, in execute
(run pid=939) raise HttpError(resp, content, uri=self.uri)
(run pid=939) googleapiclient.errors.HttpError: <HttpError 403 when requesting https://cloudresourcemanager.googleapis.com/v1/projects/<my-project-id>:getIamPolicy?alt=json returned "The caller does not have permission">
(run pid=939) 2022-01-19 03:04:36,533 WARNING util.py:166 -- The `callbacks.on_trial_result` operation took 0.756 s, which may be a performance bottleneck.
(run pid=939) 2022-01-19 03:04:36,570 WARNING util.py:166 -- The `process_trial_result` operation took 0.793 s, which may be a performance bottleneck.
(run pid=939) 2022-01-19 03:04:36,570 WARNING util.py:166 -- Processing trial results took 0.793 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.
(run pid=939) 2022-01-19 03:04:36,570 WARNING util.py:166 -- The `process_trial` operation took 0.795 s, which may be a performance bottleneck.
Versions / Dependencies
Ray 1.9.2 RLlib 1.9.2 Python 3.7.12 Torch 1.10.1 Debian GNU/Linux 10 (buster)
Reproduction script
Cluster conf:
# A unique identifier for the head node and workers of this cluster.
cluster_name: gpu-docker-t4
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 36
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 8.0
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
image: "rayproject/ray-ml:latest-gpu"
# image: rayproject/ray:latest-gpu # use this one if you don't need ML dependencies, it's faster to pull
container_name: "ray_nvidia_docker" # e.g. ray_docker
# # Example of running a GPU head with CPU workers
# head_image: "rayproject/ray-ml:latest-gpu"
# worker_image: "rayproject/ray-ml:latest"
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5
# Cloud-provider specific configuration.
provider:
availability_zone: us-central1-a
project_id: <my-project-id>
type: gcp
region: us-central1
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
# ssh_private_key: /path/to/your/key.pem
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray_head_gpu:
# The resources provided by this node type.
resources: {"CPU": 6, "GPU": 1}
# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 500
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu110
# Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
guestAccelerators:
- acceleratorType: nvidia-tesla-t4
acceleratorCount: 1
metadata:
items:
- key: install-nvidia-driver
value: "True"
scheduling:
- onHostMaintenance: TERMINATE
ray_worker_gpu:
# The minimum number of nodes of this type to launch.
# This number should be >= 0.
min_workers: 2
# The maximum number of workers nodes of this type to launch.
# This takes precedence over min_workers.
max_workers: 36
# The resources provided by this node type.
resources: {"CPU": 2, "GPU": 2}
# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 500
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu110
# Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
guestAccelerators:
- acceleratorType: nvidia-tesla-t4
acceleratorCount: 2
metadata:
items:
- key: install-nvidia-driver
value: "True"
# Run workers on preemtible instance by default.
# Comment this out to use on-demand.
scheduling:
- preemptible: true
- onHostMaintenance: TERMINATE
# Specify the node type of the head node (as configured above).
head_node_type: ray_head_gpu
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
# Wait until nvidia drivers are installed
- >-
timeout 300 bash -c "
command -v nvidia-smi && nvidia-smi
until [ \$? -eq 0 ]; do
command -v nvidia-smi && nvidia-smi
done"
# List of shell commands to run to set up nodes.
setup_commands:
- conda create -y -n rllib python=3.8
- conda activate rllib
- pip install ray[rllib] tensorflow torch
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
- pip install google-api-python-client==1.7.8
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
head_node: {}
worker_nodes: {}
$ ray up -y t4-cluster.yaml
$ submit t4-cluster.yaml cartpole_lstm.py --run "IMPALA" --framework "torch"
Anything else
This occurs every time.
There is another bug where the same call fails: https://github.com/ray-project/ray/issues/19875
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (6 by maintainers)
Top Results From Across the Web
Troubleshoot Config Sync | Anthos Config Management
If you don't see any error from the RootSync or RepoSync object, that means your Git repository is synced to the cluster. To...
Read more >Kubernetes CrashLoopBackOff Error: What It Is and How to Fix It
A common reason pods in your Kubernetes cluster display a CrashLoopBackOff message is that Kubernetes springs deprecated versions of Docker. You can reveal...
Read more >Common Errors and Solutions | CockroachDB Docs
When running a single-node CockroachDB cluster, an error about replicas failing will eventually show up in the node's log files, for example:.
Read more >Troubleshooting Deployments on GKE - Kubeflow
GKE Certificate Fails To Be Provisioned · Get the name of the Google Cloud certificate · Delete the ingress · Ensure the certificate...
Read more >Image pull failed: Back-off pulling image (Kubernetes Executor)
I will keep it like this to see if the solution is stable. In any case, this type of error seems related to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ijrsvt I can confirm that the fix in #21907 solves this issue:
@ijrsvt I need to test again with your changes. I will verify the fix in the morning