Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Huge actor loss in `TD3`

See original GitHub issue

Question

I am using a TD3 agent to solve a MountainCarContinuous-v0 environment. Is it normal that the actor’s loss gets huge at some point? Isn’t this dangerous for the weights of the model? The weird part is that the agent is actually solving the environment, since the accumulated reward is gradually increasing.

Additional context

This is my code (as suggested by @qgallouedec in https://github.com/DLR-RM/stable-baselines3/issues/936):

import gym
import numpy as np
from stable_baselines3 import TD3
from stable_baselines3.common.noise import OrnsteinUhlenbeckActionNoise

env = gym.make("MountainCarContinuous-v0")

action_noise = OrnsteinUhlenbeckActionNoise(np.zeros(1), 0.5 * np.ones(1))
model = TD3("MlpPolicy", env, action_noise=action_noise, verbose=1)

model.learn(total_timesteps=1_000_000)

This is the logger’s output after some iterations.

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 361      |
|    ep_rew_mean     | 60.8     |
| time/              |          |
|    episodes        | 68       |
|    fps             | 111      |
|    time_elapsed    | 219      |
|    total_timesteps | 24547    |
| train/             |          |
|    actor_loss      | -14.4    |
|    critic_loss     | 1.08     |
|    learning_rate   | 0.001    |
|    n_updates       | 24471    |
---------------------------------

Issue Analytics

State:
Created a year ago
Comments:8 (3 by maintainers)

Top GitHub Comments

1reaction

araffincommented, Jun 20, 2022

it depends on cumulative reward in your environment…

yes, that’s how the actor loss is defined: https://github.com/DLR-RM/stable-baselines3/blob/a6f5049a99a4c21a6f0bcce458ca3306cef310e0/stable_baselines3/td3/td3.py#L182

It depends on the magnitude of the reward and on the discount factor.

0reactions

AndreasKaratzascommented, Jun 20, 2022

Yeah, correct. Well, I think that covers my question. Thank you all 😃

Top Results From Across the Web

Problems with training actor-critic (huge negative loss) - Reddit

I am implementing actor critic and trying to train it on some simple environment like CartPole but my loss goes towards -∞ and...

Visualizing the Loss Landscape of Actor Critic Methods ... - arXiv

Figure 4: Actor loss functions of TD3 trained on Walker2d, Hopper, Ant, and HalfCheetah. Action smoothing on the upper row. Action smoothing in ......

actor critic policy loss going to zero (with no improvement)

As for the value, a high loss at the beginning is expected because it is essentially guessing at what the optimal value is....

Artificial Intelligence Learns to Walk with Actor Critic Deep ...

Twin Delayed Deep Deterministic Policy Gradients ( TD3 ) is a state of the art actor critic algorithm for mastering environments with ...

Introduction to Reinforcement Learning (DDPG and TD3) for ...

Eventually it the actor will do better actions (maybe maybe maybe) and the loss will converge to zero or something.