Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] [Train] Cannot reproduce fault-tolerance, script hangs upon any node shutdown

See original GitHub issue

Ray Component

Ray Train

What happened + What you expected to happen

I just run the pytorch official Train+Torch DDP here on a 2 node cluster on GCP (each with a single k80)

It works, but whenever I try to kill one of them to test fault-tolerance, it hangs indefinitely waiting for another node
Note that the cluster manager successfully launches a new node (as I have a min-workers of 2 in the script), but the script still hangs
I tried having an additional available node (min-workers=3), but still the script hangs upon any node failure
I feel it is a problem in matching requirements from Train with avail nodes
Sometimes killing a single node, makes the whole training fails (not sure why)

Versions / Dependencies

latest master & Pytorch 1.10

Reproduction script

This is the cluster yaml

cluster_name: my-cluster

upscaling_speed: 10.0
idle_timeout_minutes: 50

min_workers: 4
max_workers: 40

provider:
    type: gcp
    region: us-central1
    availability_zone: us-central1-a
    project_id: my-proj

auth:
    ssh_user: user

available_node_types:
    ray_head_default:
        min_workers: 0
        max_workers: 0
        resources: {"CPU": 0}
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  sourceImage: projects/path/to/myimage
            scheduling:
              - provisioningModel: SPOT
              - onHostMaintenance: TERMINATE
            
    ray_worker_small:
        min_workers: 2
        resources: {"CPU": 8, "GPU": 1.0}
        node_config:
            machineType: n1-standard-8
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  sourceImage: projects/path/to/myimage

            guestAccelerators:
              - acceleratorType: nvidia-tesla-k80
                acceleratorCount: 1
            scheduling:
              - provisioningModel: SPOT
              - onHostMaintenance: TERMINATE
            
            networkInterfaces:
              - accessConfigs:
                - name: "External NAT"
                nicType: GVNIC
                subnetwork: "projects/my-proj/regions/us-central1/subnetworks/default"

setup_commands:
    - pip uninstall -y ray
    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

amogkamcommented, Feb 19, 2022

Hey @ASDen I am able to reproduce this locally. What’s happening is that the default Pytorch communication timeout is 30 minutes, so training is hanging on gradient synchronization and not raising an error and therefore Ray Train’s fault tolerance is not being triggered.

The fix for you should just be to specify a lower timeout when you create your Trainer Trainer(backend=TorchConfig(timeout_s=10), ...). We will make the default timeout lower in the next release (https://github.com/ray-project/ray/pull/22511). Thanks for bringing this issue up!

0reactions

amogkamcommented, Feb 23, 2022

@ASDen can you do ps aux to find the pid of the process running your python script, and then do py-spy dump --pid {MY_SCRIPT_PID}?

And is that the only output from ray stack? I believe there should be more output. Was this run on every node?

Top Results From Across the Web

After upgrading to AOS 6.0.1 or higher the cvm_shutdown

After upgrading to AOS 6.0.1 or higher the cvm_shutdown -P script fails to shutdown the CVM with the below error:

Troubleshooting TensorFlow - TPU - Google Cloud

This section describes how to troubleshoot situations where TensorFlow stops responding or prints an error when connecting to the TPU.

Let It Crash: Best Practices for Handling Node.js Errors on ...

In this post, I'll walk through some of the background on the Node.js process lifecycle and some strategies to properly handle graceful shutdown...

Cisco Firepower Release Notes, Version 6.6 - Security

Resolved Bugs in Version 6.6.5 ; CSCvh19737. HTTPS access on FTD data interface (off-box management) is failing ; CSCvi96835. No validation err ...

Adding End-to-End Fault Tolerance - the Ray documentation

Meanwhile, KubeRay detects that the node itself has failed, so it attempts to restart the worker pod on another running node, and it...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[Bug] [Train] Cannot reproduce fault-tolerance, script hangs upon any node shutdown

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[Bug] Alive node reported by ray.node() failed to execute remote task

Failing on`ray down`