permissions on rsync'd files are incorrect on worker nodes, results in inability to update workers
See original GitHub issueWhat is the problem?
I can’t get a cluster to scale up after launching and using existing nodes. The files fail to sync.
# ray --version
ray, version 1.0.1.post1
# python --version
Python 3.7.7
Reproduction (REQUIRED)
- start a cluster with 2 nodes
- scale to more, say, 4
- when head tries to push files via rsync command runner, some of the files on the worker node are owned by
root, instead ofubuntu - this results in rsync error
# rsync command with -vvv output on
rsync --rsh "ssh -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/f786baef9d/%C -o ControlPersist=10s -o ConnectTimeout=120s" -avz --omit-dir-times --exclude **/.git --exclude **/.git/** --filter "dir-merge,- .gitignore" /project/ ubuntu@172.31.6.63:/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/project/ -vvv
# ...... output last few lines below.....
recv_files(nodes.py)
rsync: mkstemp "/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/nodes.py.S7lMrv" failed: Permission denied (13)
chunk[0] of size 700 at 0 offset=0
chunk[1] of size 700 at 700 offset=700
chunk[2] of size 700 at 1400 offset=1400
chunk[3] of size 700 at 2100 offset=2100
chunk[4] of size 700 at 2800 offset=2800
chunk[5] of size 60 at 3500 offset=3500
got file_sum
recv_files(src/pipeline.py)
rsync: mkstemp "/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/pipeline.py.hDutCb" failed: Permission denied (13)
chunk[0] of size 700 at 0 offset=0
rsync: connection unexpectedly closed (318054 bytes received so far) [sender]
[sender] _exit_cleanup(code=12, file=io.c, line=235): entered
rsync error: error in rsync protocol data stream (code 12) at io.c(235) [sender=3.1.2]
[sender] _exit_cleanup(code=12, file=io.c, line=235): about to call exit(12)
YAML to reproduce (minus the actual code repo, which I can’t share, but shouldn’t matter):
min_workers: 2
max_workers: 2
docker:
image: anyscale/ray-ml:latest-gpu
container_name: ray_container
pull_before_run: False
head_node:
InstanceType: p2.xlarge
IamInstanceProfile:
Arn: '<ARN HERE>'
worker_nodes:
InstanceType: p2.xlarge
IamInstanceProfile:
Arn: '<ARN HERE>'
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
setup_commands:
- >-
git clone git@bitbucket.org/org/project.git /project || true;
target_utilization_fraction: 0.8
head_start_ray_commands:
- >-
ray stop;
ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml;
worker_start_ray_commands:
- >-
ray stop;
ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076;
provider:
type: aws
region: ap-southeast-2
cache_stopped_nodes: true
auth:
ssh_user: ubuntu
metadata:
anyscale:
working_dir: /project
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
How can I configure rsync to create target directory on remote ...
Assuming you are using ssh to connect rsync, what about to send a ssh command ... 2: then you can specify the remote...
Read more >Setting permissions in project spaces
rsync command. By default, rsync will not use the correct permissions when copying files into your project space. However, it offers multiple ...
Read more >rsync - Synchronize content of two buckets/directories
Description. The gsutil rsync command makes the contents under dst_url the same as the contents under src_url, by copying any missing files/objects (or ......
Read more >Cisco Firepower Release Notes, Version 7.0 - Security
Unable to download captured file from FMC Captured files UI ... ASA conn data-rate: incorrect "current rate" and "data-rate-filter" doesn't work properly.
Read more >Use rsync to copy files from one broker to another
You can run rsync command to copy over all data from an old broker to a new broker, preserving modification times and permissions....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Some more context: If we start a cluster with 4 nodes (and do not run any ray applications) and wait for those to start we get the following situation:
The nodes fail around (from
monitor.out):and
monitor.errhas the following repeated many times:This issue is most likely that rsync is able to copy files and then fails to update time because the files in question are owned by root, not by ubuntu. This is supported by the following
ls -la /tmp/ray_tmp_mount/project_folderon the worker node (note that the errors only appear when the file is owned byroot).Any updates on this issue? This still happens for me with
rsync_up, e.g., with the following error:I haven’t really found a workaround other than running a command on all worker vms that changes the owner of the generated files. This is super hacky though.