Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to attribute reward to multiple model runs in the same trajectory with PPO

See original GitHub issue

I want to finetune a base model M to maximize a reward R, when the model is used inside of a more complex system. Take a simple example of the setting. The trajectory is as follows: sample prompt_1 from a dataset of prompts, then

prompt1 -> M(prompt1)  = out_1
out_1 -> F(out_1) = prompt_2
prompt_2 -> M(prompt_2) = out_2
out_2 -> R(out_2) = reward

where F : str -> str and R : str -> int are some methods defined in my code. Is there a way to do this in the current TRLX framework, preferably online with PPO? Alternative suggestions are welcome.

Issue Analytics

State:
Created a year ago
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

dpalekacommented, Nov 1, 2022

As of now, TRLX supports only RL setups where all “actions” to attribute the reward to are done before the reward function is called.

@dpaleka, isn’t this already the case in your very first pseudocode snippet? R is only called after both M calls, not in between, right?

I miswrote; what is true is that TRLX assumes the “actions” are just a single model.generate call.

0reactions

paulbricmancommented, Nov 2, 2022

@paulbricman are you on the discord? https://discord.gg/canadagoose

Yes, just @paul on that server.

Hey! I am open to rectifying this, I am just at capacity right now and I don’t think we have the engineering manpower for this at the moment. @paulbricman @dpaleka if you two would be interested in implemented, I’d be happy to assign you and then review it.

I’d be excited to help implement it, but I’m skeptical about whether I understand PPO well enough and whether I’m familiar enough with the trlx codebase to do it. I might be able to make a contribution if both @dpaleka and I would work on this?

Also thanks for the super quick reply!

Top Results From Across the Web

ElegantRL: Mastering PPO Algorithms | by Xiao-Yang Liu

Tutorial for Proximal Policy Optimization Algorithms (PPO) ... Step 4: computes the exact reward for each trajectory in each step.

The 37 Implementation Details of Proximal Policy Optimization

Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate...

PPO — Stable Baselines3 1.7.0a8 documentation

The Proximal Policy Optimization algorithm combines ideas from A2C (having multiple workers) and TRPO (it uses a trust region to improve the actor)....

Samples of Reward Functions for AWS DeepRacer - LinkedIn

Here, we are adopting two important features from example 1 and example 3 and merging them into a new reward function. By combining...

Evaluating the Robustness of Natural Language Reward ...

the embarrassingly parallel nature of many deep learning models. Slow ... reward function–returning to the same state does not yield any additional.