question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Errors in training using provided config

See original GitHub issue

❓ Questions and Help

Hi, I am trying to reproduce the results on PointNav task (on Matterport3D dataset) using config habitat_baselines/config/pointnav/ppo_pointnav.yaml. This runs into errors after a few hundred to thousand updates. I tried to print actions during rollout collection and the policy seems to converge to a single action. Can you please help me with this?

......
......
......
<pre>2020-02-22 12:18:13,421 update: 100	env-time: 141.647s	pth-time: 68.360s	frames: 38784
2020-02-22 12:18:13,421 Average window size 50 reward: nan
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::&lt;unnamed&gt;::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [0,0,0] Assertion `val &gt;= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::&lt;unnamed&gt;::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [1,0,0] Assertion `val &gt;= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::&lt;unnamed&gt;::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [2,0,0] Assertion `val &gt;= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::&lt;unnamed&gt;::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [3,0,0] Assertion `val &gt;= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::&lt;unnamed&gt;::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [2,0,0], thread: [0,0,0] Assertion `val &gt;= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::&lt;unnamed&gt;::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [2,0,0], thread: [1,0,0] Assertion `val &gt;= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::&lt;unnamed&gt;::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [2,0,0], thread: [2,0,0] Assertion `val &gt;= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::&lt;unnamed&gt;::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [2,0,0], thread: [3,0,0] Assertion `val &gt;= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::&lt;unnamed&gt;::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [0,0,0], thread: [0,0,0] Assertion `val &gt;= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::&lt;unnamed&gt;::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [0,0,0], thread: [1,0,0] Assertion `val &gt;= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::&lt;unnamed&gt;::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [0,0,0], thread: [2,0,0] Assertion `val &gt;= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::&lt;unnamed&gt;::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [0,0,0], thread: [3,0,0] Assertion `val &gt;= zero` failed.
Traceback (most recent call last):
  File &quot;habitat_baselines/run.py&quot;, line 218, in &lt;module&gt;
    main()
  File &quot;habitat_baselines/run.py&quot;, line 168, in main
    run_exp(**vars(args))
  File &quot;habitat_baselines/run.py&quot;, line 212, in run_exp
    trainer.train()
  File &quot;/local-scratch/_saim_/habitatAPI/habitat_baselines/rl/ppo/ppo_trainer.py&quot;, line 298, in train
    episode_counts,
  File &quot;/local-scratch/_saim_/habitatAPI/habitat_baselines/rl/ppo/ppo_trainer.py&quot;, line 144, in _collect_rollout_step
    outputs = self.envs.step([a[0].item() for a in actions])
  File &quot;/local-scratch/_saim_/habitatAPI/habitat_baselines/rl/ppo/ppo_trainer.py&quot;, line 144, in &lt;listcomp&gt;
    outputs = self.envs.step([a[0].item() for a in actions])
RuntimeError: CUDA error: device-side assert triggered
Exception ignored in: &lt;bound method VectorEnv.__del__ of &lt;habitat.core.vector_env.VectorEnv object at 0x7f8a3dd36da0&gt;&gt;
Traceback (most recent call last):
  File &quot;/local-scratch/_saim_/habitatAPI/habitat/core/vector_env.py&quot;, line 468, in __del__
  File &quot;/local-scratch/_saim_/habitatAPI/habitat/core/vector_env.py&quot;, line 350, in close
  File &quot;/local-scratch/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py&quot;, line 206, in send
AttributeError: &apos;NoneType&apos; object has no attribute &apos;dumps&apos;</pre>

Thanks!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5

github_iconTop GitHub Comments

2reactions
erikwijmanscommented, Mar 16, 2020

Took a closer look. There a select few episodes that are now invalid after fixing a bug that seemingly should have had no effect on episode navigability. Fortunately, this only happened in train, so it is straightforward to just remove them: https://gist.github.com/erikwijmans/e4410f0e12facb87890e919aa264e3fe

0reactions
erikwijmanscommented, Mar 16, 2020

Quick thing try: Move these lines https://github.com/facebookresearch/habitat-api/blob/master/habitat_baselines/rl/ppo/ppo_trainer.py#L332-L338 to after the agent update here: https://github.com/facebookresearch/habitat-api/blob/master/habitat_baselines/rl/ppo/ppo_trainer.py#L357

PyTorch ~recently changed the way the LR schedulers work (the optimizer now needs to be stepped first), does that fix it?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Errors in training using provided config · Issue #311 - GitHub
Hi, I am trying to reproduce the results on PointNav task (on Matterport3D dataset) using config habitat_baselines/config/pointnav/ppo_pointnav.
Read more >
Stardist 3D Training Error with Original Config Values
Hi, I've hand-annotated a small set of z-stacks which I have injected in the Demo3D dataset and tried to train the model using...
Read more >
What to do when you get an error - Hugging Face Course
In this section we'll look at some common errors that can occur when you're trying to generate predictions from your freshly tuned Transformer...
Read more >
While doing training for creating a Mule application, error ...
While doing training for creating a Mule application, error message "Please check your project, there is probably an error in your configurations.
Read more >
Error during training in deepspeech Internal: Failed to call ...
Error during training in deepspeech Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode].
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found