question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Random but consistent crashes with KeyError: 'CPU'

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Ray installed from (source or binary): Installed from source
  • Ray version: 0.5.3, built from commit d10cb570ab735a5f67f027e784792b6094d514fc
  • Python version: 3.6.6
  • Exact command to reproduce: Unfortunately I can’t seem to make a reproducible test case.

Describe the problem

I’ve been consistently getting the following error after tune runs from some time. It occurs both when running with one head node or with one head node and multiple other nodes. Inspecting the logs doesn’t provide any cue on what is happening.

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 6/8 CPUs, 0/1 GPUs
Memory usage on this node: 17.8/50.5 GB
Result logdir: /home/mvdoc/ray_results/test
PENDING trials:
 - tikhonov_lvl1_205_n=116.0,m=15.0,m=75.0:     PENDING
 - tikhonov_lvl1_206_n=14.0,m=184.0,m=40.0:     PENDING
 - tikhonov_lvl1_207_n=240.0,m=586.0,m=111.0:   PENDING
 - tikhonov_lvl1_208_n=1085.0,m=7.0,m=982.0:    PENDING
  ... 10 not shown
 - tikhonov_lvl1_219_n=147.0,m=403.0,m=430.0:   PENDING
 - tikhonov_lvl1_220_n=23.0,m=55.0,m=132.0:     PENDING
 - tikhonov_lvl1_221_n=390.0,m=241.0,m=155.0:   PENDING
 - tikhonov_lvl1_222_n=94.0,m=34.0,m=106.0:     PENDING
RUNNING trials:
 - tikhonov_lvl1_202_n=671.0,m=64.0,m=14.0:     RUNNING
 - tikhonov_lvl1_203_n=55.0,m=122.0,m=3.0:      RUNNING
 - tikhonov_lvl1_204_n=286.0,m=19.0,m=62.0:     RUNNING
TERMINATED trials:
 - tikhonov_lvl1_1_n=70.0,m=924.0,m=530.0:      TERMINATED [pid=28261], 295 s, 1 iter
 - tikhonov_lvl1_2_n=6.0,m=1.0,m=112.0: TERMINATED [pid=28263], 299 s, 1 iter
 - tikhonov_lvl1_3_n=9.0,m=898.0,m=3.0: TERMINATED [pid=28259], 296 s, 1 iter
 - tikhonov_lvl1_4_n=445.0,m=21.0,m=586.0:      TERMINATED [pid=28262], 292 s, 1 iter
  ... 193 not shown
 - tikhonov_lvl1_198_n=536.0,m=432.0,m=177.0:   TERMINATED [pid=7470], 193 s, 1 iter
 - tikhonov_lvl1_199_n=3.0,m=87.0,m=286.0:      TERMINATED [pid=7473], 163 s, 1 iter
 - tikhonov_lvl1_200_n=135.0,m=37.0,m=21.0:     TERMINATED [pid=7550], 72 s, 1 iter
 - tikhonov_lvl1_201_n=75.0,m=101.0,m=245.0:    TERMINATED [pid=7553], 203 s, 1 iter

Traceback (most recent call last):
  File "compute_scores_lvl1.py", line 258, in <module>
    main(exp_config)
  File "compute_scores_lvl1.py", line 241, in main
    queue_trials=True)
  File "/opt/src/ray/python/ray/tune/tune.py", line 108, in run_experiments
    runner.step()
  File "/opt/src/ray/python/ray/tune/trial_runner.py", line 113, in step
    self.trial_executor.on_step_begin()
  File "/opt/src/ray/python/ray/tune/ray_trial_executor.py", line 276, in on_step_begin
    self._update_avail_resources()
  File "/opt/src/ray/python/ray/tune/ray_trial_executor.py", line 219, in _update_avail_resources
    num_cpus = resources["CPU"]
KeyError: 'CPU'

Source code / logs

The head node is initialized as follows:

#!/bin/bash
export MKL_NUM_THREADS=2
export OMP_NUM_THREADS="$MKL_NUM_THREADS"

ray start --head --no-ui --redis-port 6379 --num-cpus 8

Tune is called as follows, where Tikhonov is a subclass of Trainable:

    register_trainable("tikhonov_lvl1", Tikhonov)
    config = {
        'test': {
            "run": "tikhonov_lvl1",
            "num_samples": 5000,
            "trial_resources": {
                "cpu": 2,
                "gpu": 0
            },
            "stop": {
                "finished": True
            },
            "checkpoint_freq": 1,
            "max_failures": 10,
        }
    }
    feature_samplers = {fn: hp.qloguniform(fn, 0, 7, 1) for fn in feature_names}
    njobs = 22
    algo = HyperOptSearch(feature_samplers,
                          max_concurrent=njobs,
                          reward_attr="performance")
    run_experiments(config,
                    search_alg=algo,
                    with_server=True,
                    queue_trials=True)

After the crash, it seems that some of the processes are dead (but I don’t know which one). Running the following script returns warnings:

# test_ray.py
import ray
import sys
import os
import multiprocessing
import time

ray.init(redis_address="XXXX:6379")

@ray.remote
def f():
    time.sleep(0.01)
    return ray.services.get_node_ip_address(), os.environ['MKL_NUM_THREADS'], multiprocessing.cpu_count()

# Get a list of the IP addresses of the nodes that have joined the cluster.
machines = set(ray.get([f.remote() for _ in range(100)]))
print("IP   MKL      MAXCORES")
for m in machines:
    print(m)
print(ray.global_state.cluster_resources())

returns

$ python test_ray.py
Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
mvdoccommented, Nov 16, 2018

Hi @richardliaw, unfortunately I can’t respond satisfactorily to your questions – I have since restarted the experiment. I will add more information to this issue as soon as I get the same error. As a ballpark, the last time it happened it crashed after 200 trials (each lasting about 3 mins).

Very glad to see #3309, that is an exciting feature. Thank you for the great work, really enjoying ray and tune.

0reactions
richardliawcommented, Dec 11, 2018

OK I’ll close this for now, but @mvdoc if this arises again, please let me know.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting two errors, keyerror and typeerror after script is left to ...
After running this bit of code in an endless while True loop, it randomly crashes and throws these two errors:.
Read more >
BSOD crash consistently on new PC build when playing games.
Ever since i bought it, within the first few days i was getting seemingly random freezing leading to game crashing or just straight...
Read more >
Question - same crashes keep happening for the past 5+ years.
With this system I've had random crashes from as early as I can remember. I tried to do some basic troubleshooting but since...
Read more >
Changelog — Python 3.11.1 documentation
gh-97728: Fix possible crashes caused by the use of uninitialized variables when pass invalid arguments in os.system() on Windows and in Windows-specific ...
Read more >
Ryzen 5000 PC Crashes Help? WHEA Logger : r/Amd - Reddit
Since i build this PC on Friday my pc keeps having weird random crashes but it happens when i am doing little to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found