question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to attribute reward to multiple model runs in the same trajectory with PPO

See original GitHub issue

I want to finetune a base model M to maximize a reward R, when the model is used inside of a more complex system. Take a simple example of the setting. The trajectory is as follows: sample prompt_1 from a dataset of prompts, then

prompt1 -> M(prompt1)  = out_1
out_1 -> F(out_1) = prompt_2
prompt_2 -> M(prompt_2) = out_2
out_2 -> R(out_2) = reward

where F : str -> str and R : str -> int are some methods defined in my code. Is there a way to do this in the current TRLX framework, preferably online with PPO? Alternative suggestions are welcome.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
dpalekacommented, Nov 1, 2022

As of now, TRLX supports only RL setups where all “actions” to attribute the reward to are done before the reward function is called.

@dpaleka, isn’t this already the case in your very first pseudocode snippet? R is only called after both M calls, not in between, right?

I miswrote; what is true is that TRLX assumes the “actions” are just a single model.generate call.

0reactions
paulbricmancommented, Nov 2, 2022

@paulbricman are you on the discord? https://discord.gg/canadagoose

Yes, just @paul on that server.

Hey! I am open to rectifying this, I am just at capacity right now and I don’t think we have the engineering manpower for this at the moment. @paulbricman @dpaleka if you two would be interested in implemented, I’d be happy to assign you and then review it.

I’d be excited to help implement it, but I’m skeptical about whether I understand PPO well enough and whether I’m familiar enough with the trlx codebase to do it. I might be able to make a contribution if both @dpaleka and I would work on this?

Also thanks for the super quick reply!

Read more comments on GitHub >

github_iconTop Results From Across the Web

ElegantRL: Mastering PPO Algorithms | by Xiao-Yang Liu
Tutorial for Proximal Policy Optimization Algorithms (PPO) ... Step 4: computes the exact reward for each trajectory in each step.
Read more >
The 37 Implementation Details of Proximal Policy Optimization
Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate...
Read more >
PPO — Stable Baselines3 1.7.0a8 documentation
The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses a trust region to improve the actor)....
Read more >
Samples of Reward Functions for AWS DeepRacer - LinkedIn
Here, we are adopting two important features from example 1 and example 3 and merging them into a new reward function. By combining...
Read more >
Evaluating the Robustness of Natural Language Reward ...
the embarrassingly parallel nature of many deep learning models. Slow ... reward function–returning to the same state does not yield any additional.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found