Errors in training using provided config
See original GitHub issue❓ Questions and Help
Hi, I am trying to reproduce the results on PointNav task (on Matterport3D dataset) using config habitat_baselines/config/pointnav/ppo_pointnav.yaml. This runs into errors after a few hundred to thousand updates. I tried to print actions during rollout collection and the policy seems to converge to a single action. Can you please help me with this?
......
......
......
<pre>2020-02-22 12:18:13,421 update: 100 env-time: 141.647s pth-time: 68.360s frames: 38784
2020-02-22 12:18:13,421 Average window size 50 reward: nan
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [0,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [1,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [2,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [3,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [2,0,0], thread: [0,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [2,0,0], thread: [1,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [2,0,0], thread: [2,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [2,0,0], thread: [3,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [0,0,0], thread: [0,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [0,0,0], thread: [1,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [0,0,0], thread: [2,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [0,0,0], thread: [3,0,0] Assertion `val >= zero` failed.
Traceback (most recent call last):
File "habitat_baselines/run.py", line 218, in <module>
main()
File "habitat_baselines/run.py", line 168, in main
run_exp(**vars(args))
File "habitat_baselines/run.py", line 212, in run_exp
trainer.train()
File "/local-scratch/_saim_/habitatAPI/habitat_baselines/rl/ppo/ppo_trainer.py", line 298, in train
episode_counts,
File "/local-scratch/_saim_/habitatAPI/habitat_baselines/rl/ppo/ppo_trainer.py", line 144, in _collect_rollout_step
outputs = self.envs.step([a[0].item() for a in actions])
File "/local-scratch/_saim_/habitatAPI/habitat_baselines/rl/ppo/ppo_trainer.py", line 144, in <listcomp>
outputs = self.envs.step([a[0].item() for a in actions])
RuntimeError: CUDA error: device-side assert triggered
Exception ignored in: <bound method VectorEnv.__del__ of <habitat.core.vector_env.VectorEnv object at 0x7f8a3dd36da0>>
Traceback (most recent call last):
File "/local-scratch/_saim_/habitatAPI/habitat/core/vector_env.py", line 468, in __del__
File "/local-scratch/_saim_/habitatAPI/habitat/core/vector_env.py", line 350, in close
File "/local-scratch/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 206, in send
AttributeError: 'NoneType' object has no attribute 'dumps'</pre>
Thanks!
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top Results From Across the Web
Errors in training using provided config · Issue #311 - GitHub
Hi, I am trying to reproduce the results on PointNav task (on Matterport3D dataset) using config habitat_baselines/config/pointnav/ppo_pointnav.
Read more >Stardist 3D Training Error with Original Config Values
Hi, I've hand-annotated a small set of z-stacks which I have injected in the Demo3D dataset and tried to train the model using...
Read more >What to do when you get an error - Hugging Face Course
In this section we'll look at some common errors that can occur when you're trying to generate predictions from your freshly tuned Transformer...
Read more >While doing training for creating a Mule application, error ...
While doing training for creating a Mule application, error message "Please check your project, there is probably an error in your configurations.
Read more >Error during training in deepspeech Internal: Failed to call ...
Error during training in deepspeech Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode].
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Took a closer look. There a select few episodes that are now invalid after fixing a bug that seemingly should have had no effect on episode navigability. Fortunately, this only happened in train, so it is straightforward to just remove them: https://gist.github.com/erikwijmans/e4410f0e12facb87890e919aa264e3fe
Quick thing try: Move these lines https://github.com/facebookresearch/habitat-api/blob/master/habitat_baselines/rl/ppo/ppo_trainer.py#L332-L338 to after the agent update here: https://github.com/facebookresearch/habitat-api/blob/master/habitat_baselines/rl/ppo/ppo_trainer.py#L357
PyTorch ~recently changed the way the LR schedulers work (the optimizer now needs to be stepped first), does that fix it?