Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: received 0 items of ancdata with custom gym

See original GitHub issue

I’m getting a RuntimeError when I try to run several custom gyms in parallel.

Traceback (most recent call last):
  File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/run.py", line 270, in <module>
    main()
  File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/run.py", line 244, in main
    model, env = train(args, extra_args)
  File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/run.py", line 88, in train
    **alg_kwargs
  File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/ppo2/ppo2.py", line 329, in learn
    obs, returns, masks, actions, values, neglogpacs, states, epinfos = runner.run() #pylint: disable=E0632
  File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/ppo2/ppo2.py", line 178, in run
    self.obs[:], rewards, self.dones, infos = self.env.step(actions)
  File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/common/vec_env/__init__.py", line 100, in step
    return self.step_wait()
  File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/common/vec_env/vec_normalize.py", line 23, in step_wait
    obs, rews, news, infos = self.venv.step_wait()
  File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/common/vec_env/subproc_vec_env.py", line 70, in step_wait
    results = [remote.recv() for remote in self.remotes]
  File "/home/atcold/Work/GitHub/OpenAI-RL-baselines/baselines/common/vec_env/subproc_vec_env.py", line 70, in <listcomp>
    results = [remote.recv() for remote in self.remotes]
  File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
  File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd
    fd = df.detach()
  File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/misc/vlgscratch4/LecunGroup/atcold/anaconda3/envs/OpenAI/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata

No clue about how to debug this. Do I need to add any special functionality to my gym to support parallel execution?

Issue Analytics

State:
Created 5 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

Atcoldcommented, Oct 10, 2018

Finally debugged this crap. It’s actually quite embarrassing typing here what was the cause… but I’ll do it for sake of rigour.

My gym step(action) function was returning:

observation: a NumPy multidimensional array;
reward: a scalar;
done: a boolean;
info: the whole agent (for debugging purposes).

The thing that I didn’t take in account is that with multiple processes the agent has to be serialised and sent through a pipe. My agent contains a crap ton of stuff, pandas tables, pygame cached font, and more shit.

Sending str(agent) instead (which is telling me who the current agent is) fixes the problem.

Now I have a new bug

self.ret = self.ret * self.gamma + rews
ValueError: operands could not be broadcast together with shapes (12,) (2,)

but this is a new adventure by its own, so I’m closing this issue. Thank you for your interest. I hope I’ve entertained you 😉

0reactions

Atcoldcommented, Oct 9, 2018

Alright. Made some progress. ulimit -n returns 1024. Setting ulimit -n 2048 made the script work for longer, but it died afterwards. So, I’m pretty sure I’m hitting “some” limit due to some missing deallocation of resources. I have now to figure out who’s doing this nasty thing.

Cc: @ikostrikov.