Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Sync execution failed on a GCP cluster

See original GitHub issue

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Clusters RLlib

What happened + What you expected to happen

While running this example on a GCP cluster with one head node and two worker nodes, this error appears: googleapiclient.errors.HttpError: <HttpError 403 when requesting https://cloudresourcemanager.googleapis.com/v1/projects/<my-project-id>:getIamPolicy?alt=json returned “The caller does not have permission”>

(run pid=939) 2022-01-19 03:04:35,849	INFO commands.py:293 -- Checking GCP environment settings
(run pid=939) 2022-01-19 03:04:35,857	WARN util.py:137 -- The `head_node` field is deprecated and will be ignored. Use `head_node_type` and `available_node_types` instead.
(run pid=939) 2022-01-19 03:04:36,533	ERROR syncer.py:254 -- Sync execution failed.
(run pid=939) Traceback (most recent call last):
(run pid=939)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 251, in sync_down
(run pid=939)     self._remote_path, self._local_dir, exclude=exclude)
(run pid=939)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/integration/docker.py", line 140, in sync_down
(run pid=939)     use_internal_ip=True)
(run pid=939)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/sdk.py", line 150, in rsync
(run pid=939)     should_bootstrap=should_bootstrap)
(run pid=939)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 1079, in rsync
(run pid=939)     config = _bootstrap_config(config, no_config_cache=no_config_cache)
(run pid=939)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 316, in _bootstrap_config
(run pid=939)     resolved_config = provider_cls.bootstrap_config(config)
(run pid=939)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 187, in bootstrap_config
(run pid=939)     return bootstrap_gcp(cluster_config)
(run pid=939)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/gcp/config.py", line 272, in bootstrap_gcp
(run pid=939)     config = _configure_iam_role(config, crm, iam)
(run pid=939)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/gcp/config.py", line 342, in _configure_iam_role
(run pid=939)     _add_iam_policy_binding(service_account, roles, crm)
(run pid=939)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/gcp/config.py", line 582, in _add_iam_policy_binding
(run pid=939)     resource=project_id, body={}).execute()
(run pid=939)   File "/home/ray/anaconda3/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
(run pid=939)     return wrapped(*args, **kwargs)
(run pid=939)   File "/home/ray/anaconda3/lib/python3.7/site-packages/googleapiclient/http.py", line 851, in execute
(run pid=939)     raise HttpError(resp, content, uri=self.uri)
(run pid=939) googleapiclient.errors.HttpError: <HttpError 403 when requesting https://cloudresourcemanager.googleapis.com/v1/projects/<my-project-id>:getIamPolicy?alt=json returned "The caller does not have permission">
(run pid=939) 2022-01-19 03:04:36,533	WARNING util.py:166 -- The `callbacks.on_trial_result` operation took 0.756 s, which may be a performance bottleneck.
(run pid=939) 2022-01-19 03:04:36,570	WARNING util.py:166 -- The `process_trial_result` operation took 0.793 s, which may be a performance bottleneck.
(run pid=939) 2022-01-19 03:04:36,570	WARNING util.py:166 -- Processing trial results took 0.793 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.
(run pid=939) 2022-01-19 03:04:36,570	WARNING util.py:166 -- The `process_trial` operation took 0.795 s, which may be a performance bottleneck.

Versions / Dependencies

Ray 1.9.2 RLlib 1.9.2 Python 3.7.12 Torch 1.10.1 Debian GNU/Linux 10 (buster)

Reproduction script

Cluster conf:

# A unique identifier for the head node and workers of this cluster.
cluster_name: gpu-docker-t4

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 36

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 8.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    image: "rayproject/ray-ml:latest-gpu"
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_nvidia_docker" # e.g. ray_docker

    # # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"


    # worker_image: "rayproject/ray-ml:latest"

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    availability_zone: us-central1-a
    project_id: <my-project-id>
    type: gcp
    region: us-central1

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray_head_gpu:
        # The resources provided by this node type.
        resources: {"CPU": 6, "GPU": 1}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 500
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu110
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            guestAccelerators:
              - acceleratorType: nvidia-tesla-t4
                acceleratorCount: 1
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            scheduling:
              - onHostMaintenance: TERMINATE

    ray_worker_gpu:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 2
        # The maximum number of workers nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 36
        # The resources provided by this node type.
        resources: {"CPU": 2, "GPU": 2}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 500
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu110
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            guestAccelerators:
              - acceleratorType: nvidia-tesla-t4
                acceleratorCount: 2
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand.
            scheduling:
              - preemptible: true
              - onHostMaintenance: TERMINATE

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_gpu

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    # Wait until nvidia drivers are installed
    - >-
      timeout 300 bash -c "
          command -v nvidia-smi && nvidia-smi
          until [ \$? -eq 0 ]; do
              command -v nvidia-smi && nvidia-smi
          done"

# List of shell commands to run to set up nodes.
setup_commands:
    - conda create -y -n rllib python=3.8
    - conda activate rllib
    - pip install ray[rllib] tensorflow torch

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip install google-api-python-client==1.7.8

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076
head_node: {}
worker_nodes: {}

$ ray up -y t4-cluster.yaml
$ submit t4-cluster.yaml cartpole_lstm.py --run "IMPALA" --framework "torch"

Anything else

This occurs every time.

There is another bug where the same call fails: https://github.com/ray-project/ray/issues/19875

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Comments:13 (6 by maintainers)

Top GitHub Comments

1reaction

melonipoikacommented, Jan 27, 2022

@ijrsvt I can confirm that the fix in #21907 solves this issue:

$ gcloud projects add-iam-policy-binding <PROJECT_ID> --member=serviceAccount:ray-autoscaler-sa-v1@<PROJECT_ID>.iam.gserviceaccount.com --role=roles/iam.roleViewer

<snip>

- members:
  - serviceAccount:ray-autoscaler-sa-v1@<PROJECT-ID>.iam.gserviceaccount.com
  role: roles/iam.roleViewer

1reaction

melonipoikacommented, Jan 26, 2022

@ijrsvt I need to test again with your changes. I will verify the fix in the morning

Top Results From Across the Web

Troubleshoot Config Sync | Anthos Config Management

If you don't see any error from the RootSync or RepoSync object, that means your Git repository is synced to the cluster. To...

Kubernetes CrashLoopBackOff Error: What It Is and How to Fix It

A common reason pods in your Kubernetes cluster display a CrashLoopBackOff message is that Kubernetes springs deprecated versions of Docker. You can reveal...

Common Errors and Solutions | CockroachDB Docs

When running a single-node CockroachDB cluster, an error about replicas failing will eventually show up in the node's log files, for example:.

Troubleshooting Deployments on GKE - Kubeflow

GKE Certificate Fails To Be Provisioned · Get the name of the Google Cloud certificate · Delete the ingress · Ensure the certificate...

Image pull failed: Back-off pulling image (Kubernetes Executor)

I will keep it like this to see if the solution is stable. In any case, this type of error seems related to...