question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] [Train] Cannot reproduce fault-tolerance, script hangs upon any node shutdown

See original GitHub issue

Ray Component

Ray Train

What happened + What you expected to happen

I just run the pytorch official Train+Torch DDP here on a 2 node cluster on GCP (each with a single k80)

  • It works, but whenever I try to kill one of them to test fault-tolerance, it hangs indefinitely waiting for another node
  • Note that the cluster manager successfully launches a new node (as I have a min-workers of 2 in the script), but the script still hangs
  • I tried having an additional available node (min-workers=3), but still the script hangs upon any node failure
  • I feel it is a problem in matching requirements from Train with avail nodes
  • Sometimes killing a single node, makes the whole training fails (not sure why)

Versions / Dependencies

latest master & Pytorch 1.10

Reproduction script

This is the cluster yaml

cluster_name: my-cluster

upscaling_speed: 10.0
idle_timeout_minutes: 50

min_workers: 4
max_workers: 40

provider:
    type: gcp
    region: us-central1
    availability_zone: us-central1-a
    project_id: my-proj

auth:
    ssh_user: user

available_node_types:
    ray_head_default:
        min_workers: 0
        max_workers: 0
        resources: {"CPU": 0}
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  sourceImage: projects/path/to/myimage
            scheduling:
              - provisioningModel: SPOT
              - onHostMaintenance: TERMINATE
            
    ray_worker_small:
        min_workers: 2
        resources: {"CPU": 8, "GPU": 1.0}
        node_config:
            machineType: n1-standard-8
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  sourceImage: projects/path/to/myimage

            guestAccelerators:
              - acceleratorType: nvidia-tesla-k80
                acceleratorCount: 1
            scheduling:
              - provisioningModel: SPOT
              - onHostMaintenance: TERMINATE
            
            networkInterfaces:
              - accessConfigs:
                - name: "External NAT"
                nicType: GVNIC
                subnetwork: "projects/my-proj/regions/us-central1/subnetworks/default"

setup_commands:
    - pip uninstall -y ray
    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
amogkamcommented, Feb 19, 2022

Hey @ASDen I am able to reproduce this locally. What’s happening is that the default Pytorch communication timeout is 30 minutes, so training is hanging on gradient synchronization and not raising an error and therefore Ray Train’s fault tolerance is not being triggered.

The fix for you should just be to specify a lower timeout when you create your Trainer Trainer(backend=TorchConfig(timeout_s=10), ...). We will make the default timeout lower in the next release (https://github.com/ray-project/ray/pull/22511). Thanks for bringing this issue up!

0reactions
amogkamcommented, Feb 23, 2022

@ASDen can you do ps aux to find the pid of the process running your python script, and then do py-spy dump --pid {MY_SCRIPT_PID}?

And is that the only output from ray stack? I believe there should be more output. Was this run on every node?

Read more comments on GitHub >

github_iconTop Results From Across the Web

After upgrading to AOS 6.0.1 or higher the cvm_shutdown
After upgrading to AOS 6.0.1 or higher the cvm_shutdown -P script fails to shutdown the CVM with the below error:
Read more >
Troubleshooting TensorFlow - TPU - Google Cloud
This section describes how to troubleshoot situations where TensorFlow stops responding or prints an error when connecting to the TPU.
Read more >
Let It Crash: Best Practices for Handling Node.js Errors on ...
In this post, I'll walk through some of the background on the Node.js process lifecycle and some strategies to properly handle graceful shutdown...
Read more >
Cisco Firepower Release Notes, Version 6.6 - Security
Resolved Bugs in Version 6.6.5 ; CSCvh19737. HTTPS access on FTD data interface (off-box management) is failing ; CSCvi96835. No validation err ...
Read more >
Adding End-to-End Fault Tolerance - the Ray documentation
Meanwhile, KubeRay detects that the node itself has failed, so it attempts to restart the worker pod on another running node, and it...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found