RuntimeError: received 0 items of ancdata with custom gym
See original GitHub issueI’m getting a RuntimeError
when I try to run several custom gyms in parallel.
Traceback (most recent call last):
File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/run.py", line 270, in <module>
main()
File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/run.py", line 244, in main
model, env = train(args, extra_args)
File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/run.py", line 88, in train
**alg_kwargs
File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/ppo2/ppo2.py", line 329, in learn
obs, returns, masks, actions, values, neglogpacs, states, epinfos = runner.run() #pylint: disable=E0632
File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/ppo2/ppo2.py", line 178, in run
self.obs[:], rewards, self.dones, infos = self.env.step(actions)
File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/common/vec_env/__init__.py", line 100, in step
return self.step_wait()
File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/common/vec_env/vec_normalize.py", line 23, in step_wait
obs, rews, news, infos = self.venv.step_wait()
File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/common/vec_env/subproc_vec_env.py", line 70, in step_wait
results = [remote.recv() for remote in self.remotes]
File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/common/vec_env/subproc_vec_env.py", line 70, in <listcomp>
results = [remote.recv() for remote in self.remotes]
File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
fd = df.detach()
File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata
No clue about how to debug this. Do I need to add any special functionality to my gym to support parallel execution?
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
RuntimeError: received 0 items of ancdata - PyTorch Forums
How to solve it? 1 Like. Training stops due to Caught RuntimeError in DataLoader worker process 0 with large dataset of files.
Read more >How to resolve the error: RuntimeError: received 0 items of ...
utils.data.DataLoader. I have created them with the following code. transform_train = transforms.Compose([ transforms.RandomCrop(32, padding ...
Read more >RuntimeError: received 0 items of ancdata - Part 1 (2019)
I have a -trained- learner, which I'm trying to use to make predictions on a validation set, which consists of 100.000 samples, via...
Read more >aiocoap - Read the Docs
In a separate terminal, use the aiocoap-client tool to send a GET request ... Message at 0x0123deadbeef: no mtype, GET (no MID, empty...
Read more >Diff - platform/prebuilts/build-tools - Google Git
+ +ZERO-CLAUSE BSD LICENSE FOR CODE IN THE PYTHON DOCUMENTATION ... + if len(items) == 1: + traverser(items[0]) + self.write(",") + else: + ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Finally debugged this crap. It’s actually quite embarrassing typing here what was the cause… but I’ll do it for sake of rigour.
My gym
step(action)
function was returning:observation
: a NumPy multidimensional array;reward
: a scalar;done
: a boolean;info
: the whole agent (for debugging purposes).The thing that I didn’t take in account is that with multiple processes the agent has to be serialised and sent through a pipe. My agent contains a crap ton of stuff, pandas tables, pygame cached font, and more shit.
Sending
str(agent)
instead (which is telling me who the current agent is) fixes the problem.Now I have a new bug
but this is a new adventure by its own, so I’m closing this issue. Thank you for your interest. I hope I’ve entertained you 😉
Alright. Made some progress.
ulimit -n
returns1024
. Settingulimit -n 2048
made the script work for longer, but it died afterwards. So, I’m pretty sure I’m hitting “some” limit due to some missing deallocation of resources. I have now to figure out who’s doing this nasty thing.Cc: @ikostrikov.