Random but consistent crashes with KeyError: 'CPU'
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- Ray installed from (source or binary): Installed from source
- Ray version: 0.5.3, built from commit d10cb570ab735a5f67f027e784792b6094d514fc
- Python version: 3.6.6
- Exact command to reproduce: Unfortunately I can’t seem to make a reproducible test case.
Describe the problem
I’ve been consistently getting the following error after tune runs from some time. It occurs both when running with one head node or with one head node and multiple other nodes. Inspecting the logs doesn’t provide any cue on what is happening.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 6/8 CPUs, 0/1 GPUs
Memory usage on this node: 17.8/50.5 GB
Result logdir: /home/mvdoc/ray_results/test
PENDING trials:
- tikhonov_lvl1_205_n=116.0,m=15.0,m=75.0: PENDING
- tikhonov_lvl1_206_n=14.0,m=184.0,m=40.0: PENDING
- tikhonov_lvl1_207_n=240.0,m=586.0,m=111.0: PENDING
- tikhonov_lvl1_208_n=1085.0,m=7.0,m=982.0: PENDING
... 10 not shown
- tikhonov_lvl1_219_n=147.0,m=403.0,m=430.0: PENDING
- tikhonov_lvl1_220_n=23.0,m=55.0,m=132.0: PENDING
- tikhonov_lvl1_221_n=390.0,m=241.0,m=155.0: PENDING
- tikhonov_lvl1_222_n=94.0,m=34.0,m=106.0: PENDING
RUNNING trials:
- tikhonov_lvl1_202_n=671.0,m=64.0,m=14.0: RUNNING
- tikhonov_lvl1_203_n=55.0,m=122.0,m=3.0: RUNNING
- tikhonov_lvl1_204_n=286.0,m=19.0,m=62.0: RUNNING
TERMINATED trials:
- tikhonov_lvl1_1_n=70.0,m=924.0,m=530.0: TERMINATED [pid=28261], 295 s, 1 iter
- tikhonov_lvl1_2_n=6.0,m=1.0,m=112.0: TERMINATED [pid=28263], 299 s, 1 iter
- tikhonov_lvl1_3_n=9.0,m=898.0,m=3.0: TERMINATED [pid=28259], 296 s, 1 iter
- tikhonov_lvl1_4_n=445.0,m=21.0,m=586.0: TERMINATED [pid=28262], 292 s, 1 iter
... 193 not shown
- tikhonov_lvl1_198_n=536.0,m=432.0,m=177.0: TERMINATED [pid=7470], 193 s, 1 iter
- tikhonov_lvl1_199_n=3.0,m=87.0,m=286.0: TERMINATED [pid=7473], 163 s, 1 iter
- tikhonov_lvl1_200_n=135.0,m=37.0,m=21.0: TERMINATED [pid=7550], 72 s, 1 iter
- tikhonov_lvl1_201_n=75.0,m=101.0,m=245.0: TERMINATED [pid=7553], 203 s, 1 iter
Traceback (most recent call last):
File "compute_scores_lvl1.py", line 258, in <module>
main(exp_config)
File "compute_scores_lvl1.py", line 241, in main
queue_trials=True)
File "/opt/src/ray/python/ray/tune/tune.py", line 108, in run_experiments
runner.step()
File "/opt/src/ray/python/ray/tune/trial_runner.py", line 113, in step
self.trial_executor.on_step_begin()
File "/opt/src/ray/python/ray/tune/ray_trial_executor.py", line 276, in on_step_begin
self._update_avail_resources()
File "/opt/src/ray/python/ray/tune/ray_trial_executor.py", line 219, in _update_avail_resources
num_cpus = resources["CPU"]
KeyError: 'CPU'
Source code / logs
The head node is initialized as follows:
#!/bin/bash
export MKL_NUM_THREADS=2
export OMP_NUM_THREADS="$MKL_NUM_THREADS"
ray start --head --no-ui --redis-port 6379 --num-cpus 8
Tune is called as follows, where Tikhonov
is a subclass of Trainable
:
register_trainable("tikhonov_lvl1", Tikhonov)
config = {
'test': {
"run": "tikhonov_lvl1",
"num_samples": 5000,
"trial_resources": {
"cpu": 2,
"gpu": 0
},
"stop": {
"finished": True
},
"checkpoint_freq": 1,
"max_failures": 10,
}
}
feature_samplers = {fn: hp.qloguniform(fn, 0, 7, 1) for fn in feature_names}
njobs = 22
algo = HyperOptSearch(feature_samplers,
max_concurrent=njobs,
reward_attr="performance")
run_experiments(config,
search_alg=algo,
with_server=True,
queue_trials=True)
After the crash, it seems that some of the processes are dead (but I don’t know which one). Running the following script returns warnings:
# test_ray.py
import ray
import sys
import os
import multiprocessing
import time
ray.init(redis_address="XXXX:6379")
@ray.remote
def f():
time.sleep(0.01)
return ray.services.get_node_ip_address(), os.environ['MKL_NUM_THREADS'], multiprocessing.cpu_count()
# Get a list of the IP addresses of the nodes that have joined the cluster.
machines = set(ray.get([f.remote() for _ in range(100)]))
print("IP MKL MAXCORES")
for m in machines:
print(m)
print(ray.global_state.cluster_resources())
returns
$ python test_ray.py
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (2 by maintainers)
Top Results From Across the Web
Getting two errors, keyerror and typeerror after script is left to ...
After running this bit of code in an endless while True loop, it randomly crashes and throws these two errors:.
Read more >BSOD crash consistently on new PC build when playing games.
Ever since i bought it, within the first few days i was getting seemingly random freezing leading to game crashing or just straight...
Read more >Question - same crashes keep happening for the past 5+ years.
With this system I've had random crashes from as early as I can remember. I tried to do some basic troubleshooting but since...
Read more >Changelog — Python 3.11.1 documentation
gh-97728: Fix possible crashes caused by the use of uninitialized variables when pass invalid arguments in os.system() on Windows and in Windows-specific ...
Read more >Ryzen 5000 PC Crashes Help? WHEA Logger : r/Amd - Reddit
Since i build this PC on Friday my pc keeps having weird random crashes but it happens when i am doing little to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @richardliaw, unfortunately I can’t respond satisfactorily to your questions – I have since restarted the experiment. I will add more information to this issue as soon as I get the same error. As a ballpark, the last time it happened it crashed after 200 trials (each lasting about 3 mins).
Very glad to see #3309, that is an exciting feature. Thank you for the great work, really enjoying ray and tune.
OK I’ll close this for now, but @mvdoc if this arises again, please let me know.