loss function (In Policy Gradient section), optimizer and entropy
See original GitHub issueDear Mr.hongzi
I was interested in your resource scheduling method. Now, I stuck in your network class. I can’t understand why you used the blow function:
loss = T.log(prob_act[T.arange(N), actions]).dot(values) / N
Did you calculate the special loss function? If you didn’t, what’s the name of this loss function?
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (4 by maintainers)
Top Results From Across the Web
Understanding the Impact of Entropy on Policy ... - arXiv
In this work, we analyze this claim using new visualizations of the optimization landscape based on randomly perturbing the loss function. We ...
Read more >Understanding the Impact of Entropy on Policy Optimization
In this work, we analyze this claim using new visualizations of the optimization landscape based on randomly perturbing the loss function. We first...
Read more >Policy-Gradient Methods. REINFORCE algorithm
The policy gradient method will iteratively amend the policy network weights (with smooth updates) to make state-action pairs that resulted ...
Read more >Chapter 10. Reinforcement learning with policy gradients
Policy gradient methods provide a scheme for estimating which direction to shift the weights in order to make the agent better at its...
Read more >Policy Gradient Algorithms | Lil'Log
Asynchronous Advantage Actor-Critic (Mnih et al., 2016), short for A3C, is a classic policy gradient method with a special focus on parallel ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
You are right that
rmsprop_updates
is a customized function. I guess back at that time standardized library for those optimizers were not available 😃 Things are easier nowadays. And you are right about the gradient operations in tensorflow or pytorch.Here’s how we computed the advantage Gt, with a time-based baseline: https://github.com/hongzimao/deeprm/blob/master/pg_re.py#L193-L202.
IIRC, RMSProp was slightly more stable than Adam in our experiment. FWIW, A3C original paper also used RMSProp (https://arxiv.org/pdf/1602.01783.pdf see Optimizations in section 4).
The last comment was about different episode termination criteria. It’s their literal meaning I think, ‘no new jobs’ ends the episode when no new jobs are coming and ‘all_done’ only terminates the episode when all jobs (including the unfinished ones when ‘no_new_jobs’ is satisfied) are completed: https://github.com/hongzimao/deeprm/blob/b42eff0ab843c83c2b1b8d44e65f99440fa2a543/environment.py#L255-L265.