question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

permissions on rsync'd files are incorrect on worker nodes, results in inability to update workers

See original GitHub issue

What is the problem?

I can’t get a cluster to scale up after launching and using existing nodes. The files fail to sync.

# ray --version
ray, version 1.0.1.post1
# python --version
Python 3.7.7

Reproduction (REQUIRED)

  1. start a cluster with 2 nodes
  2. scale to more, say, 4
  3. when head tries to push files via rsync command runner, some of the files on the worker node are owned by root, instead of ubuntu
  4. this results in rsync error
#  rsync command with -vvv output on
rsync --rsh "ssh -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/f786baef9d/%C -o ControlPersist=10s -o ConnectTimeout=120s" -avz --omit-dir-times --exclude **/.git --exclude **/.git/** --filter "dir-merge,- .gitignore" /project/ ubuntu@172.31.6.63:/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/project/ -vvv

# ...... output last few lines below.....
recv_files(nodes.py)
rsync: mkstemp "/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/nodes.py.S7lMrv" failed: Permission denied (13)
chunk[0] of size 700 at 0 offset=0
chunk[1] of size 700 at 700 offset=700
chunk[2] of size 700 at 1400 offset=1400
chunk[3] of size 700 at 2100 offset=2100
chunk[4] of size 700 at 2800 offset=2800
chunk[5] of size 60 at 3500 offset=3500
got file_sum
recv_files(src/pipeline.py)
rsync: mkstemp "/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/pipeline.py.hDutCb" failed: Permission denied (13)
chunk[0] of size 700 at 0 offset=0
rsync: connection unexpectedly closed (318054 bytes received so far) [sender]
[sender] _exit_cleanup(code=12, file=io.c, line=235): entered
rsync error: error in rsync protocol data stream (code 12) at io.c(235) [sender=3.1.2]
[sender] _exit_cleanup(code=12, file=io.c, line=235): about to call exit(12)

YAML to reproduce (minus the actual code repo, which I can’t share, but shouldn’t matter):

min_workers: 2
max_workers: 2

docker:
    image: anyscale/ray-ml:latest-gpu
    container_name: ray_container
    pull_before_run: False

head_node:
    InstanceType: p2.xlarge
    IamInstanceProfile:
        Arn: '<ARN HERE>'
worker_nodes:
    InstanceType: p2.xlarge
    IamInstanceProfile:
        Arn: '<ARN HERE>'

rsync_exclude:
    - "**/.git"
    - "**/.git/**"
rsync_filter:
    - ".gitignore"

setup_commands:
    - >-
      git clone git@bitbucket.org/org/project.git /project || true;

target_utilization_fraction: 0.8

head_start_ray_commands:
    - >-
      ray stop;
      ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml;

worker_start_ray_commands:
    - >-
      ray stop;
      ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076;

provider:
    type: aws
    region: ap-southeast-2
    cache_stopped_nodes: true

auth:
    ssh_user: ubuntu

metadata:
    anyscale:
        working_dir: /project
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
ijrsvtcommented, Dec 17, 2020

Some more context: If we start a cluster with 4 nodes (and do not run any ray applications) and wait for those to start we get the following situation:

2020-12-17 21:18:30,211 INFO autoscaler.py:591 -- Cluster status: 4/4 target nodes (0 pending) (4 failed to update)

The nodes fail around (from monitor.out):

2020-12-17 03:58:35,877	INFO updater.py:225 -- [4/6] Processing worker file mounts

and monitor.err has the following repeated many times:

rsync: failed to set times on "/tmp/ray_tmp_mount/project_folder/.git/

This issue is most likely that rsync is able to copy files and then fails to update time because the files in question are owned by root, not by ubuntu. This is supported by the following ls -la /tmp/ray_tmp_mount/project_folder on the worker node (note that the errors only appear when the file is owned by root).

drwxr-xr-x  8 root   root   4096 Dec 17 20:13 .git
-rw-r--r--  1 ubuntu ubuntu 2003 Dec 17 19:56 .gitignore
0reactions
dirkweissenborncommented, May 13, 2022

Any updates on this issue? This still happens for me with rsync_up, e.g., with the following error:

chown: changing ownership of '/tmp/ray_tmp_mount/****/__pycache__/****.cpython-38.pyc': Operation not permitted

I haven’t really found a workaround other than running a command on all worker vms that changes the owner of the generated files. This is super hacky though.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How can I configure rsync to create target directory on remote ...
Assuming you are using ssh to connect rsync, what about to send a ssh command ... 2: then you can specify the remote...
Read more >
Setting permissions in project spaces
rsync command. By default, rsync will not use the correct permissions when copying files into your project space. However, it offers multiple ...
Read more >
rsync - Synchronize content of two buckets/directories
Description. The gsutil rsync command makes the contents under dst_url the same as the contents under src_url, by copying any missing files/objects (or ......
Read more >
Cisco Firepower Release Notes, Version 7.0 - Security
Unable to download captured file from FMC Captured files UI ... ASA conn data-rate: incorrect "current rate" and "data-rate-filter" doesn't work properly.
Read more >
Use rsync to copy files from one broker to another
You can run rsync command to copy over all data from an old broker to a new broker, preserving modification times and permissions....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found