PPO model training with habitat 2020 challenge config
See original GitHub issue@mathfac @dhruvbatra Hi! Another issue on my side I am struggling with is the right training of habitat baseline ppo model with 2020 challenge configuration for pointnav task using habitat-api.
As PPO agent configuration I use the following file ppo_pointnav.yaml
BASE_TASK_CONFIG_PATH: "configs/tasks/pointnav_gib_rgbd_2020.yaml"
TRAINER_NAME: "ppo"
ENV_NAME: "NavRLEnv"
SIMULATOR_GPU_ID: 1
TORCH_GPU_ID: 1
VIDEO_OPTION: ["disk", "tensorboard"]
TENSORBOARD_DIR: "tb"
VIDEO_DIR: "video_dir"
TEST_EPISODE_COUNT: 994
EVAL_CKPT_PATH_DIR: "data/ppo_2020_checkpoints"
NUM_PROCESSES: 4
SENSORS: ["DEPTH_SENSOR"]
CHECKPOINT_FOLDER: "data/ppo_2020_checkpoints"
NUM_UPDATES: 270000
LOG_INTERVAL: 25
CHECKPOINT_INTERVAL: 2000
RL:
PPO:
clip_param: 0.1
ppo_epoch: 4
num_mini_batch: 2
value_loss_coef: 0.5
entropy_coef: 0.01
lr: 2.5e-4
eps: 1e-5
max_grad_norm: 0.5
num_steps: 128
hidden_size: 512
use_gae: True
gamma: 0.99
tau: 0.95
use_linear_clip_decay: True
use_linear_lr_decay: True
reward_window_size: 50
And for task configuration I used the same parameters as in challenge_pointnav2020.local.rgbd.yaml file:
ENVIRONMENT:
MAX_EPISODE_STEPS: 500
SIMULATOR:
AGENT_0:
SENSORS: ['RGB_SENSOR', 'DEPTH_SENSOR']
HEIGHT: 0.88
RADIUS: 0.18
HABITAT_SIM_V0:
GPU_DEVICE_ID: 0
ALLOW_SLIDING: False
RGB_SENSOR:
WIDTH: 640
HEIGHT: 360
HFOV: 70
POSITION: [0, 0.88, 0]
NOISE_MODEL: "GaussianNoiseModel"
NOISE_MODEL_KWARGS:
intensity_constant: 0.1
DEPTH_SENSOR:
WIDTH: 640
HEIGHT: 360
HFOV: 70
MIN_DEPTH: 0.1
MAX_DEPTH: 10.0
POSITION: [0, 0.88, 0]
NOISE_MODEL: "RedwoodDepthNoiseModel"
ACTION_SPACE_CONFIG: 'pyrobotnoisy'
NOISE_MODEL:
ROBOT: "LoCoBot"
CONTROLLER: 'Proportional'
NOISE_MULTIPLIER: 0.5
TASK:
TYPE: Nav-v0
SUCCESS_DISTANCE: 0.36
SENSORS: ['POINTGOAL_SENSOR']
POINTGOAL_SENSOR:
GOAL_FORMAT: POLAR
DIMENSIONALITY: 2
GOAL_SENSOR_UUID: pointgoal
MEASUREMENTS: ['DISTANCE_TO_GOAL', "SUCCESS", 'SPL']
SUCCESS:
SUCCESS_DISTANCE: 0.36
Just changed path to the train dataset (Habitat Challenge Data for Gibson (1.5 GB)):
DATASET:
TYPE: PointNav-v1
SPLIT: train
DATA_PATH: data/datasets/pointnav/gibson/v1/{split}/{split}.json.gz
After runnig command python -u habitat_baselines/run.py --exp-config habitat_baselines/config/pointnav/ppo_pointnav.yaml --run-type
I got the following error:
---
The active scene does not contain semantic annotations.
---
I0325 20:17:08.559100 8915 simulator.py:143] Loaded navmesh data/scene_datasets/gibson/Monson.navmesh
I0325 20:17:08.559392 8915 simulator.py:155] Recomputing navmesh for agent's height 0.88 and radius 0.18.
I0325 20:17:08.567361 8915 PathFinder.cpp:338] Building navmesh with 275x112 cells
I0325 20:17:08.655342 8915 PathFinder.cpp:606] Created navmesh with 137 vertices 61 polygons
I0325 20:17:08.655371 8915 Simulator.cpp:403] reconstruct navmesh successful
2020-03-25 20:17:08,720 Initializing task Nav-v0
2020-03-25 20:17:11,725 agent number of parameters: 52694149
/home/pryhoda/anaconda3/envs/habitat/lib/python3.6/site-packages/torch-1.4.0-py3.6-linux-x86_64.egg/torch/optim/lr_scheduler.py:122: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [3,0,0], thread: [0,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [3,0,0], thread: [1,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [3,0,0], thread: [2,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [3,0,0], thread: [3,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [0,0,0], thread: [0,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [0,0,0], thread: [1,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [0,0,0], thread: [2,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [0,0,0], thread: [3,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [2,0,0], thread: [0,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [2,0,0], thread: [1,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [2,0,0], thread: [2,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [2,0,0], thread: [3,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [0,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [1,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [2,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:243: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [1,0,0], thread: [3,0,0] Assertion `val >= zero` failed.
Traceback (most recent call last):
File "habitat_baselines/run.py", line 70, in <module>
main()
File "habitat_baselines/run.py", line 40, in main
run_exp(**vars(args))
File "habitat_baselines/run.py", line 64, in run_exp
trainer.train()
File "/home/pryhoda/HabitatProject/habitat-api/habitat_baselines/rl/ppo/ppo_trainer.py", line 346, in train
rollouts, current_episode_reward, running_episode_stats
File "/home/pryhoda/HabitatProject/habitat-api/habitat_baselines/rl/ppo/ppo_trainer.py", line 181, in _collect_rollout_step
outputs = self.envs.step([a[0].item() for a in actions])
File "/home/pryhoda/HabitatProject/habitat-api/habitat_baselines/rl/ppo/ppo_trainer.py", line 181, in <listcomp>
outputs = self.envs.step([a[0].item() for a in actions])
RuntimeError: CUDA error: device-side assert triggered
Exception ignored in: <bound method VectorEnv.__del__ of <habitat.core.vector_env.VectorEnv object at 0x7f8dfa79ea58>>
Traceback (most recent call last):
File "/home/pryhoda/HabitatProject/habitat-api/habitat/core/vector_env.py", line 468, in __del__
self.close()
File "/home/pryhoda/HabitatProject/habitat-api/habitat/core/vector_env.py", line 350, in close
write_fn((CLOSE_COMMAND, None))
File "/home/pryhoda/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/pryhoda/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/home/pryhoda/anaconda3/envs/habitat/lib/python3.6/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
I am wondering if PPO agent is not adapted to train with 2020 challenge config (it run ok for me with 2019 challenge config - pointnav_gibson_rgbd.yaml ) or it is some issues on my side ? Thanks in advance!
Issue Analytics
- State:
- Created 3 years ago
- Comments:20 (10 by maintainers)
Top GitHub Comments
@erikwijmans As you suggested I trained DD-PPO model with resnet18 backbone.
When I tried to evaluate it, I got the following error:
Looks like it loads config for model with resnet50 backbone, but here is my config file:
I am wondering where could be the problem.
You can change resnet50 to resnet18 in the config, that will improve training speed.