RayGetError after evicting obejcts
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
- Ray installed from (source or binary): binary
- Ray version: 9dd3eedbac31d93cc32e9e87d03e8d8da1507fa6
- Python version: 3.6.5
Describe the problem
Trials fail due to RayGetError
. This is not the same as #3170 . In that issue, trails failed after running for a while, which has been solved already.
For this, it seems that trails raise RayGetError
when the backend starts to evict objects to free memory. I guess some objects, which are still used, are evicted unexpectedly.
I’ll try to find a simple setting to reproduce this.
Source code / logs
Remote function [31mtrain[39m failed with: Traceback (most recent call last): File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 845, in _process_task *arguments) File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor method_returns = method(actor, *args) File "/home/llan/Workspaces/morrl/maml/maml.py", line 161, in train return Agent.__base__.train(self) File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trainable.py", line 146, in train result = self._train() File "/home/llan/Workspaces/morrl/maml/maml.py", line 152, in _train fetches = self.optimizer.step() File "/home/llan/Workspaces/morrl/maml/maml_optimizer.py", line 39, in step for e in self.remote_evaluators]) File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2349, in get raise RayGetError(object_ids[i], value) ray.worker.RayGetError: Could not get objectid ObjectID(01000000f006fc77673052333588622ecb0ec8c7). It was created by remote function [31m<unknown>[39m which failed with: Remote function [31m<unknown>[39m failed with: Invalid return value: likely worker died or was killed while executing the task; check previous logs or dmesg for errors. Error processing event. Traceback (most recent call last): File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 240, in _process_events result = self.trial_executor.fetch_result(trial) File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 200, in fetch_result result = ray.get(trial_future[0]) File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2357, in get raise RayGetError(object_ids, value) ray.worker.RayGetError: Could not get objectid ObjectID(01000000637ea2c29a8df2a4a1c969c52a413a2a). It was created by remote function [31mtrain[39m which failed with: Remote function [31mtrain[39m failed with: Traceback (most recent call last): File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 845, in _process_task *arguments) File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor method_returns = method(actor, *args) File "/home/llan/Workspaces/morrl/maml/maml.py", line 161, in train return Agent.__base__.train(self) File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trainable.py", line 146, in train result = self._train() File "/home/llan/Workspaces/morrl/maml/maml.py", line 152, in _train fetches = self.optimizer.step() File "/home/llan/Workspaces/morrl/maml/maml_optimizer.py", line 39, in step for e in self.remote_evaluators]) File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2349, in get raise RayGetError(object_ids[i], value) ray.worker.RayGetError: Could not get objectid ObjectID(01000000f006fc77673052333588622ecb0ec8c7). It was created by remote function [31m<unknown>[39m which failed with: Remote function [31m<unknown>[39m failed with: Invalid return value: likely worker died or was killed while executing the task; check previous logs or dmesg for errors. Log sync requires cluster to be setup with `ray create_or_update`. == Status == Using AsyncHyperBand: num_stopped=140 Bracket: Iter 90.000: -7.03951331596347 | Iter 30.000: -9.99005911512262 | Iter 10.000: -28.57305023532921 Bracket: Iter 90.000: -5.553955853459232 | Iter 30.000: -10.7469617836115 Bracket: Iter 90.000: -7.220892821492349 Resources requested: 189/240 CPUs, 0/4 GPUs Result logdir: /ray_results/MAML_PointEnv ERROR trials: - MAML_PointEnv_148_m=0.3,p=20.0,r=0.1,r=10,r=0.0003,d=1: ERROR, 1 failures: /ray_results/MAML_PointEnv/MAML_PointEnv_148_m=0.3,p=20.0,r=0.1,r=10,r=0.0003,d=1_2018-11-12_00-45-164j_tg9_0/error_2018-11-12_00-59-01.txt [gpu-server-00 pid=202606], 788 s, 53 iter RUNNING trials: - MAML_PointEnv_147_m=0.3,p=30.0,r=0.1,r=10,r=0.0003,d=3: RUNNING [pid=6541], 905 s, 56 iter - MAML_PointEnv_153_m=0.2,p=30.0,r=0.1,r=10,r=0.0003,d=5: RUNNING [pid=17659], 521 s, 28 iter - MAML_PointEnv_154_m=0.2,p=20.0,r=0.1,r=10,r=0.0003,d=5: RUNNING [gpu-server-00 pid=203243], 501 s, 29 iter - MAML_PointEnv_156_m=0.2,p=40.0,r=0.1,r=10,r=0.0003,d=5: RUNNING [gpu-server-01 pid=88044], 445 s, 27 iter - MAML_PointEnv_162_m=0.3,p=20.0,r=0.1,r=10,r=0.0003,d=2: RUNNING [gpu-server-00 pid=207668], 157 s, 8 iter - MAML_PointEnv_163_m=0.15,p=30.0,r=0.01,r=10,r=0.0003,d=5: RUNNING [gpu-server-00 pid=207663], 144 s, 7 iter - MAML_PointEnv_164_m=0.3,p=20.0,r=0.01,r=10,r=0.0003,d=5: RUNNING [gpu-server-01 pid=89697], 107 s, 6 iter - MAML_PointEnv_165_m=0.2,p=40.0,r=0.01,r=10,r=0.0003,d=5: RUNNING - MAML_PointEnv_166_m=0.3,p=40.0,r=0.01,r=10,r=0.0003,d=5: RUNNING TERMINATED trials: - MAML_PointEnv_1_m=0.2,p=30.0,r=0.1,r=5,r=0.001,d=3: TERMINATED [gpu-server-01 pid=28344], 1657 s, 100 iter - MAML_PointEnv_2_m=0.1,p=40.0,r=0.1,r=5,r=0.0003,d=5: TERMINATED [gpu-server-01 pid=28340], 373 s, 30 iter - MAML_PointEnv_3_m=0.1,p=20.0,r=0.1,r=5,r=0.001,d=3: TERMINATED [pid=16969], 1528 s, 90 iter - MAML_PointEnv_4_m=0.1,p=30.0,r=0.01,r=5,r=0.0003,d=5: TERMINATED [gpu-server-00 pid=150499], 657 s, 30 iter - MAML_PointEnv_5_m=0.2,p=30.0,r=0.05,r=10,r=0.0003,d=5: TERMINATED [pid=14003], 1909 s, 100 iter ... 146 not shown - MAML_PointEnv_157_m=0.3,p=20.0,r=0.01,r=10,r=0.0003,d=5: TERMINATED [gpu-server-01 pid=88038], 197 s, 10 iter - MAML_PointEnv_158_m=0.15,p=40.0,r=0.01,r=10,r=0.0003,d=5: TERMINATED [pid=22457], 195 s, 10 iter - MAML_PointEnv_159_m=0.3,p=30.0,r=0.1,r=10,r=0.0003,d=2: TERMINATED [pid=22230], 195 s, 10 iter - MAML_PointEnv_160_m=0.3,p=20.0,r=0.01,r=10,r=0.0003,d=5: TERMINATED [pid=12556], 175 s, 10 iter - MAML_PointEnv_161_m=0.3,p=40.0,r=0.01,r=10,r=0.0003,d=5: TERMINATED [pid=13458], 169 s, 10 iter
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Odd NSCache Eviction Behaviour - Stack Overflow
Which means the object is not actually getting evicted. But it's mysteriously disappearing from the NSCache since it can't be found on future ......
Read more >5 Things to Do After Evicting a Tenant (And How to Prevent It)
#2 Inspect the Property for Damage. Some tenants will intentionally damage the property once they've received the eviction notice. This is ...
Read more >What to Do with Tenant Belongings After Eviction (ALL STATES)
Issue a hand-delivered or mailed notice to the tenant informing them of their abandoned items; The landlord might choose to keep the items...
Read more >Landlord/Tenant Issues | North Carolina Judicial Branch
The landlord must file a “Complaint in Summary Ejectment” with the clerk of court. In court, the landlord must prove that grounds for...
Read more >Personal Property Left Behind - Civil Law Self-Help Center
Sometimes when a tenant vacates a rental property – whether the tenant leaves voluntarily or because the tenant is evicted – there is...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Tested using my original codes, and no errors raise now.
Btw this is my script that runs on one node with 4 schedulers.