Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Long running distributed test fails

See original GitHub issue

What is the problem?

The long running distributed release test (pytorch_pbt_failure) fails after around 10 minutes with the following error:

2021-02-04 17:58:07,590 INFO commands.py:283 -- Checking AWS environment settings
2021-02-04 17:58:08,874 INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:58:09,027 INFO commands.py:441 -- Shutdown i-03aa1f3b86602ada0
2021-02-04 17:58:09,028 INFO command_runner.py:356 -- Fetched IP: 52.36.104.14
2021-02-04 17:58:09,028 INFO log_timer.py:27 -- NodeUpdater: i-03aa1f3b86602ada0: Got IP  [LogTimer=0ms]
Warning: Permanently added '52.36.104.14' (ECDSA) to the list of known hosts.
Error: No such container: ray_container
Shared connection to 52.36.104.14 closed.
2021-02-04 17:59:20,400 WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 72.837 s, which may be a performance bottleneck.
Traceback (most recent call last):
  File "/home/ray/pytorch_pbt_failure.py", line 136, in <module>
    stop={"training_iteration": 1} if args.smoke_test else None)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
    runner.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 360, in step
    iteration=self._iteration, trials=self._trials)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/callback.py", line 172, in on_step_begin
    callback.on_step_begin(**info)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/mock.py", line 122, in on_step_begin
    override_cluster_name=None)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 460, in kill_node
    _exec(updater, "ray stop", False, False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 912, in _exec
    shutdown_after_run=shutdown_after_run)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 627, in run
    ssh_options_override_ssh_key=ssh_options_override_ssh_key)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 519, in run
    final_cmd, with_output, exit_on_fail, silent=silent)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 445, in _run_helper
    "Command failed:\n\n  {}\n".format(joined_cmd)) from None
click.exceptions.ClickException: Command failed:

  ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/3d9ed41da7/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@52.36.104.14 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)'"'"'"'"'"'"'"'"''"'"' )'

Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.