How to attribute reward to multiple model runs in the same trajectory with PPO
See original GitHub issueI want to finetune a base model M
to maximize a reward R
, when the model is used inside of a more complex system.
Take a simple example of the setting. The trajectory is as follows: sample prompt_1
from a dataset of prompts, then
prompt1 -> M(prompt1) = out_1
out_1 -> F(out_1) = prompt_2
prompt_2 -> M(prompt_2) = out_2
out_2 -> R(out_2) = reward
where F : str -> str
and R : str -> int
are some methods defined in my code.
Is there a way to do this in the current TRLX framework, preferably online with PPO?
Alternative suggestions are welcome.
Issue Analytics
- State:
- Created a year ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
ElegantRL: Mastering PPO Algorithms | by Xiao-Yang Liu
Tutorial for Proximal Policy Optimization Algorithms (PPO) ... Step 4: computes the exact reward for each trajectory in each step.
Read more >The 37 Implementation Details of Proximal Policy Optimization
Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate...
Read more >PPO — Stable Baselines3 1.7.0a8 documentation
The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses a trust region to improve the actor)....
Read more >Samples of Reward Functions for AWS DeepRacer - LinkedIn
Here, we are adopting two important features from example 1 and example 3 and merging them into a new reward function. By combining...
Read more >Evaluating the Robustness of Natural Language Reward ...
the embarrassingly parallel nature of many deep learning models. Slow ... reward function–returning to the same state does not yield any additional.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I miswrote; what is true is that TRLX assumes the “actions” are just a single
model.generate
call.Yes, just
@paul
on that server.I’d be excited to help implement it, but I’m skeptical about whether I understand PPO well enough and whether I’m familiar enough with the
trlx
codebase to do it. I might be able to make a contribution if both @dpaleka and I would work on this?Also thanks for the super quick reply!