Huge actor loss in `TD3`
See original GitHub issueQuestion
I am using a TD3
agent to solve a MountainCarContinuous-v0
environment. Is it normal that the actor’s loss gets huge at some point? Isn’t this dangerous for the weights of the model? The weird part is that the agent is actually solving the environment, since the accumulated reward is gradually increasing.
Additional context
This is my code (as suggested by @qgallouedec in https://github.com/DLR-RM/stable-baselines3/issues/936):
import gym
import numpy as np
from stable_baselines3 import TD3
from stable_baselines3.common.noise import OrnsteinUhlenbeckActionNoise
env = gym.make("MountainCarContinuous-v0")
action_noise = OrnsteinUhlenbeckActionNoise(np.zeros(1), 0.5 * np.ones(1))
model = TD3("MlpPolicy", env, action_noise=action_noise, verbose=1)
model.learn(total_timesteps=1_000_000)
This is the logger’s output after some iterations.
---------------------------------
| rollout/ | |
| ep_len_mean | 361 |
| ep_rew_mean | 60.8 |
| time/ | |
| episodes | 68 |
| fps | 111 |
| time_elapsed | 219 |
| total_timesteps | 24547 |
| train/ | |
| actor_loss | -14.4 |
| critic_loss | 1.08 |
| learning_rate | 0.001 |
| n_updates | 24471 |
---------------------------------
Issue Analytics
- State:
- Created a year ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Problems with training actor-critic (huge negative loss) - Reddit
I am implementing actor critic and trying to train it on some simple environment like CartPole but my loss goes towards -∞ and...
Read more >Visualizing the Loss Landscape of Actor Critic Methods ... - arXiv
Figure 4: Actor loss functions of TD3 trained on Walker2d, Hopper, Ant, and HalfCheetah. Action smoothing on the upper row. Action smoothing in ......
Read more >actor critic policy loss going to zero (with no improvement)
As for the value, a high loss at the beginning is expected because it is essentially guessing at what the optimal value is....
Read more >Artificial Intelligence Learns to Walk with Actor Critic Deep ...
Twin Delayed Deep Deterministic Policy Gradients ( TD3 ) is a state of the art actor critic algorithm for mastering environments with ...
Read more >Introduction to Reinforcement Learning (DDPG and TD3) for ...
Eventually it the actor will do better actions (maybe maybe maybe) and the loss will converge to zero or something.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
yes, that’s how the actor loss is defined: https://github.com/DLR-RM/stable-baselines3/blob/a6f5049a99a4c21a6f0bcce458ca3306cef310e0/stable_baselines3/td3/td3.py#L182
It depends on the magnitude of the reward and on the discount factor.
Yeah, correct. Well, I think that covers my question. Thank you all 😃