Long running distributed test fails
See original GitHub issueWhat is the problem?
The long running distributed release test (pytorch_pbt_failure
) fails after around 10 minutes with the following error:
2021-02-04 17:58:07,590 INFO commands.py:283 -- Checking AWS environment settings
2021-02-04 17:58:08,874 INFO commands.py:431 -- A random node will be killed. Confirm [y/N]: y [automatic, due to --yes]
2021-02-04 17:58:09,027 INFO commands.py:441 -- Shutdown i-03aa1f3b86602ada0
2021-02-04 17:58:09,028 INFO command_runner.py:356 -- Fetched IP: 52.36.104.14
2021-02-04 17:58:09,028 INFO log_timer.py:27 -- NodeUpdater: i-03aa1f3b86602ada0: Got IP [LogTimer=0ms]
Warning: Permanently added '52.36.104.14' (ECDSA) to the list of known hosts.
Error: No such container: ray_container
Shared connection to 52.36.104.14 closed.
2021-02-04 17:59:20,400 WARNING util.py:152 -- The `callbacks.on_step_begin` operation took 72.837 s, which may be a performance bottleneck.
Traceback (most recent call last):
File "/home/ray/pytorch_pbt_failure.py", line 136, in <module>
stop={"training_iteration": 1} if args.smoke_test else None)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
runner.step()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 360, in step
iteration=self._iteration, trials=self._trials)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/callback.py", line 172, in on_step_begin
callback.on_step_begin(**info)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/mock.py", line 122, in on_step_begin
override_cluster_name=None)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 460, in kill_node
_exec(updater, "ray stop", False, False)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/commands.py", line 912, in _exec
shutdown_after_run=shutdown_after_run)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 627, in run
ssh_options_override_ssh_key=ssh_options_override_ssh_key)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 519, in run
final_cmd, with_output, exit_on_fail, silent=silent)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/command_runner.py", line 445, in _run_helper
"Command failed:\n\n {}\n".format(joined_cmd)) from None
click.exceptions.ClickException: Command failed:
ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_070dd72385/3d9ed41da7/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@52.36.104.14 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)'"'"'"'"'"'"'"'"''"'"' )'
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Handling Failure in Long Running Processes
This article describes patterns for handling failures in long running processes: retries, timeouts, compensating transactions, ...
Read more >Troubleshooting - Distributed Load Testing on AWS
Issue: Tests are taking too long to run or are stuck indefinitely running. Solution: Cancel the test and check AWS Fargate to ensure...
Read more >Distributed JMeter test fails with java error but test will run from ...
I did get a resolution to this problem. It turned out to be a conflict between some extraneous code in a jar file...
Read more >Running large tests - Grafana k6
This document explains how to launch a large-scale k6 test on a single machine without the need for distributed execution.
Read more >Foresight Blog | How to Fix Your Failing End-to-End Tests?
Don't be surprised if your E2E test suites take longer than an hour. Long-running tests are expensive, so it's important to find errors...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@amogkam @ijrsvt I wonder if we’re killing the same node in rapid succession (before the docker image is ran the second time).
ok sg