[Bug] [Train] Cannot reproduce fault-tolerance, script hangs upon any node shutdown
See original GitHub issueRay Component
Ray Train
What happened + What you expected to happen
I just run the pytorch official Train+Torch DDP here on a 2 node cluster on GCP (each with a single k80)
- It works, but whenever I try to kill one of them to test fault-tolerance, it hangs indefinitely waiting for another node
- Note that the cluster manager successfully launches a new node (as I have a min-workers of 2 in the script), but the script still hangs
- I tried having an additional available node (min-workers=3), but still the script hangs upon any node failure
- I feel it is a problem in matching requirements from Train with avail nodes
- Sometimes killing a single node, makes the whole training fails (not sure why)
Versions / Dependencies
latest master & Pytorch 1.10
Reproduction script
This is the cluster yaml
cluster_name: my-cluster
upscaling_speed: 10.0
idle_timeout_minutes: 50
min_workers: 4
max_workers: 40
provider:
type: gcp
region: us-central1
availability_zone: us-central1-a
project_id: my-proj
auth:
ssh_user: user
available_node_types:
ray_head_default:
min_workers: 0
max_workers: 0
resources: {"CPU": 0}
node_config:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 100
sourceImage: projects/path/to/myimage
scheduling:
- provisioningModel: SPOT
- onHostMaintenance: TERMINATE
ray_worker_small:
min_workers: 2
resources: {"CPU": 8, "GPU": 1.0}
node_config:
machineType: n1-standard-8
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 100
sourceImage: projects/path/to/myimage
guestAccelerators:
- acceleratorType: nvidia-tesla-k80
acceleratorCount: 1
scheduling:
- provisioningModel: SPOT
- onHostMaintenance: TERMINATE
networkInterfaces:
- accessConfigs:
- name: "External NAT"
nicType: GVNIC
subnetwork: "projects/my-proj/regions/us-central1/subnetworks/default"
setup_commands:
- pip uninstall -y ray
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
After upgrading to AOS 6.0.1 or higher the cvm_shutdown
After upgrading to AOS 6.0.1 or higher the cvm_shutdown -P script fails to shutdown the CVM with the below error:
Read more >Troubleshooting TensorFlow - TPU - Google Cloud
This section describes how to troubleshoot situations where TensorFlow stops responding or prints an error when connecting to the TPU.
Read more >Let It Crash: Best Practices for Handling Node.js Errors on ...
In this post, I'll walk through some of the background on the Node.js process lifecycle and some strategies to properly handle graceful shutdown...
Read more >Cisco Firepower Release Notes, Version 6.6 - Security
Resolved Bugs in Version 6.6.5 ; CSCvh19737. HTTPS access on FTD data interface (off-box management) is failing ; CSCvi96835. No validation err ...
Read more >Adding End-to-End Fault Tolerance - the Ray documentation
Meanwhile, KubeRay detects that the node itself has failed, so it attempts to restart the worker pod on another running node, and it...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey @ASDen I am able to reproduce this locally. What’s happening is that the default Pytorch communication timeout is 30 minutes, so training is hanging on gradient synchronization and not raising an error and therefore Ray Train’s fault tolerance is not being triggered.
The fix for you should just be to specify a lower timeout when you create your Trainer
Trainer(backend=TorchConfig(timeout_s=10), ...)
. We will make the default timeout lower in the next release (https://github.com/ray-project/ray/pull/22511). Thanks for bringing this issue up!@ASDen can you do
ps aux
to find the pid of the process running your python script, and then dopy-spy dump --pid {MY_SCRIPT_PID}
?And is that the only output from
ray stack
? I believe there should be more output. Was this run on every node?