question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RayGetError after evicting obejcts

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Ray installed from (source or binary): binary
  • Ray version: 9dd3eedbac31d93cc32e9e87d03e8d8da1507fa6
  • Python version: 3.6.5

Describe the problem

Trials fail due to RayGetError. This is not the same as #3170 . In that issue, trails failed after running for a while, which has been solved already.

For this, it seems that trails raise RayGetError when the backend starts to evict objects to free memory. I guess some objects, which are still used, are evicted unexpectedly.

I’ll try to find a simple setting to reproduce this.

Source code / logs

Remote function train failed with:

Traceback (most recent call last):
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 845, in _process_task
    *arguments)
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
    method_returns = method(actor, *args)
  File "/home/llan/Workspaces/morrl/maml/maml.py", line 161, in train
    return Agent.__base__.train(self)
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trainable.py", line 146, in train
    result = self._train()
  File "/home/llan/Workspaces/morrl/maml/maml.py", line 152, in _train
    fetches = self.optimizer.step()
  File "/home/llan/Workspaces/morrl/maml/maml_optimizer.py", line 39, in step
    for e in self.remote_evaluators])
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2349, in get
    raise RayGetError(object_ids[i], value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000f006fc77673052333588622ecb0ec8c7). It was created by remote function <unknown> which failed with:

Remote function <unknown> failed with:

Invalid return value: likely worker died or was killed while executing the task; check previous logs or dmesg for errors.

Error processing event.
Traceback (most recent call last):
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 240, in _process_events
    result = self.trial_executor.fetch_result(trial)
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 200, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2357, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000637ea2c29a8df2a4a1c969c52a413a2a). It was created by remote function train which failed with:

Remote function train failed with:

Traceback (most recent call last):
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 845, in _process_task
    *arguments)
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
    method_returns = method(actor, *args)
  File "/home/llan/Workspaces/morrl/maml/maml.py", line 161, in train
    return Agent.__base__.train(self)
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trainable.py", line 146, in train
    result = self._train()
  File "/home/llan/Workspaces/morrl/maml/maml.py", line 152, in _train
    fetches = self.optimizer.step()
  File "/home/llan/Workspaces/morrl/maml/maml_optimizer.py", line 39, in step
    for e in self.remote_evaluators])
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2349, in get
    raise RayGetError(object_ids[i], value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000f006fc77673052333588622ecb0ec8c7). It was created by remote function <unknown> which failed with:

Remote function <unknown> failed with:

Invalid return value: likely worker died or was killed while executing the task; check previous logs or dmesg for errors.

Log sync requires cluster to be setup with `ray create_or_update`.
== Status ==
Using AsyncHyperBand: num_stopped=140
Bracket: Iter 90.000: -7.03951331596347 | Iter 30.000: -9.99005911512262 | Iter 10.000: -28.57305023532921
Bracket: Iter 90.000: -5.553955853459232 | Iter 30.000: -10.7469617836115
Bracket: Iter 90.000: -7.220892821492349
Resources requested: 189/240 CPUs, 0/4 GPUs
Result logdir: /ray_results/MAML_PointEnv
ERROR trials:
 - MAML_PointEnv_148_m=0.3,p=20.0,r=0.1,r=10,r=0.0003,d=1:	ERROR, 1 failures: /ray_results/MAML_PointEnv/MAML_PointEnv_148_m=0.3,p=20.0,r=0.1,r=10,r=0.0003,d=1_2018-11-12_00-45-164j_tg9_0/error_2018-11-12_00-59-01.txt [gpu-server-00 pid=202606], 788 s, 53 iter
RUNNING trials:
 - MAML_PointEnv_147_m=0.3,p=30.0,r=0.1,r=10,r=0.0003,d=3:	RUNNING [pid=6541], 905 s, 56 iter
 - MAML_PointEnv_153_m=0.2,p=30.0,r=0.1,r=10,r=0.0003,d=5:	RUNNING [pid=17659], 521 s, 28 iter
 - MAML_PointEnv_154_m=0.2,p=20.0,r=0.1,r=10,r=0.0003,d=5:	RUNNING [gpu-server-00 pid=203243], 501 s, 29 iter
 - MAML_PointEnv_156_m=0.2,p=40.0,r=0.1,r=10,r=0.0003,d=5:	RUNNING [gpu-server-01 pid=88044], 445 s, 27 iter
 - MAML_PointEnv_162_m=0.3,p=20.0,r=0.1,r=10,r=0.0003,d=2:	RUNNING [gpu-server-00 pid=207668], 157 s, 8 iter
 - MAML_PointEnv_163_m=0.15,p=30.0,r=0.01,r=10,r=0.0003,d=5:	RUNNING [gpu-server-00 pid=207663], 144 s, 7 iter
 - MAML_PointEnv_164_m=0.3,p=20.0,r=0.01,r=10,r=0.0003,d=5:	RUNNING [gpu-server-01 pid=89697], 107 s, 6 iter
 - MAML_PointEnv_165_m=0.2,p=40.0,r=0.01,r=10,r=0.0003,d=5:	RUNNING
 - MAML_PointEnv_166_m=0.3,p=40.0,r=0.01,r=10,r=0.0003,d=5:	RUNNING
TERMINATED trials:
 - MAML_PointEnv_1_m=0.2,p=30.0,r=0.1,r=5,r=0.001,d=3:	TERMINATED [gpu-server-01 pid=28344], 1657 s, 100 iter
 - MAML_PointEnv_2_m=0.1,p=40.0,r=0.1,r=5,r=0.0003,d=5:	TERMINATED [gpu-server-01 pid=28340], 373 s, 30 iter
 - MAML_PointEnv_3_m=0.1,p=20.0,r=0.1,r=5,r=0.001,d=3:	TERMINATED [pid=16969], 1528 s, 90 iter
 - MAML_PointEnv_4_m=0.1,p=30.0,r=0.01,r=5,r=0.0003,d=5:	TERMINATED [gpu-server-00 pid=150499], 657 s, 30 iter
 - MAML_PointEnv_5_m=0.2,p=30.0,r=0.05,r=10,r=0.0003,d=5:	TERMINATED [pid=14003], 1909 s, 100 iter
  ... 146 not shown
 - MAML_PointEnv_157_m=0.3,p=20.0,r=0.01,r=10,r=0.0003,d=5:	TERMINATED [gpu-server-01 pid=88038], 197 s, 10 iter
 - MAML_PointEnv_158_m=0.15,p=40.0,r=0.01,r=10,r=0.0003,d=5:	TERMINATED [pid=22457], 195 s, 10 iter
 - MAML_PointEnv_159_m=0.3,p=30.0,r=0.1,r=10,r=0.0003,d=2:	TERMINATED [pid=22230], 195 s, 10 iter
 - MAML_PointEnv_160_m=0.3,p=20.0,r=0.01,r=10,r=0.0003,d=5:	TERMINATED [pid=12556], 175 s, 10 iter
 - MAML_PointEnv_161_m=0.3,p=40.0,r=0.01,r=10,r=0.0003,d=5:	TERMINATED [pid=13458], 169 s, 10 iter

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
llan-mlcommented, Nov 23, 2018

Tested using my original codes, and no errors raise now.

0reactions
ericlcommented, Nov 19, 2018
# -*- coding: utf-8 -*-
# @Author  : Lin Lan (ryan.linlan@gmail.com)

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import time
import numpy as np
import ray
from ray.tune.trainable import Trainable
from ray.tune.trial import Resources
from ray.tune import register_trainable, run_experiments


@ray.remote(num_cpus=1)
class ParameterServer(object):
    def __init__(self):
        self.weights = np.random.rand(128, 128).astype(np.float64)

    def get(self):
        return self.weights

    def update(self, diff):
        self.weights += diff


@ray.remote(num_cpus=1)
class Worker(object):
    def __init__(self, seed_holder):
        self.weights = None
        self.seed_holder = seed_holder

    def set_weights(self, weights):
        self.weights = weights

    def calculate_diff(self):
        seeds = ray.get(
            [self.seed_holder.get.remote() for _ in range(100)])
        rng = np.random.choice(seeds)
        return rng.rand(*self.weights.shape)


@ray.remote(num_cpus=1)
class SeedHolder(object):
    def __init__(self):
        self.seeds = [
            np.random.RandomState(seed) for seed in range(10)]

    def get(self):
        return np.random.choice(self.seeds)


class Foo(Trainable):
    @classmethod
    def default_resource_request(cls, config):
        return Resources(
            cpu=1 + 1,
            gpu=0,
            extra_cpu=20,
            extra_gpu=0)

    def _setup(self, config):
        self.seed_holder = SeedHolder.remote()
        self.ps = ParameterServer.remote()
        self.workers = [
            Worker.remote(self.seed_holder) for _ in range(20)]

    def _train(self):
        weights = ray.get(self.ps.get.remote())
        weights_id = ray.put(weights)
        ray.get([w.set_weights.remote(weights_id)
                 for w in self.workers])

        all_diffs = ray.get(
            [e.calculate_diff.remote() for e in self.workers])

        diff = np.mean(all_diffs, axis=0)
        self.ps.update.remote(diff)
        weights = ray.get(self.ps.get.remote())
        return {"weight_norm": np.linalg.norm(weights)}



register_trainable("foo", Foo)

from ray.test.cluster_utils import Cluster
cluster = Cluster()
for _ in range(4):
    cluster.add_node(
        resources={
            "num_cpus": 64,
            "num_gpus": 0,
        },
        object_store_memory=2000000000)

ray.init(redis_address=cluster.redis_address)
run_experiments(
    {
        "test": {
            "run": "foo",
            "stop": {"training_iteration": 500},
            "num_samples": 1000,
            "local_dir": "/tmp/ray_results"
        }
    }
)

Btw this is my script that runs on one node with 4 schedulers.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Odd NSCache Eviction Behaviour - Stack Overflow
Which means the object is not actually getting evicted. But it's mysteriously disappearing from the NSCache since it can't be found on future ......
Read more >
5 Things to Do After Evicting a Tenant (And How to Prevent It)
#2 Inspect the Property for Damage. Some tenants will intentionally damage the property once they've received the eviction notice. This is ...
Read more >
What to Do with Tenant Belongings After Eviction (ALL STATES)
Issue a hand-delivered or mailed notice to the tenant informing them of their abandoned items; The landlord might choose to keep the items...
Read more >
Landlord/Tenant Issues | North Carolina Judicial Branch
The landlord must file a “Complaint in Summary Ejectment” with the clerk of court. In court, the landlord must prove that grounds for...
Read more >
Personal Property Left Behind - Civil Law Self-Help Center
Sometimes when a tenant vacates a rental property – whether the tenant leaves voluntarily or because the tenant is evicted – there is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found