Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Discount not applied in evaluate_policy?

See original GitHub issue

Maybe I am missing something here but I feel the line to calculate return: https://github.com/keiohta/tf2rl/blob/82d9eecda78e22021efa0821bf02429ac7827f4d/tf2rl/experiments/trainer.py#L207

should be updated to include the discount factor

            for j in range(total_steps):
                action = self._policy.get_action(obs, test=True)
                next_obs, reward, done, _ = self._test_env.step(action)
                avg_test_steps += 1
                if self._save_test_path:
                    replay_buffer.add(obs=obs, act=action,
                                      next_obs=next_obs, rew=reward, done=done)

                if self._save_test_movie:
                    element = self._test_env.render(mode='rgb_array')
                    frames.append(element)
                elif self._show_test_progress:
                    self._test_env.render()
                episode_return += reward * np.power(self._policy.discount, j)

Issue Analytics

State:
Created 2 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

keiohtacommented, Jul 29, 2021

I agree with @ymd-h that the evaluation score does not include the discount factor. I think the reason why the DDQN paper reports the discounted return is to evaluate the overestimation phenomenon: since the Q-network produces the estimated discounted cumulative rewards, the “true” return should be computed with the discount factor. I don’t think other paper reports discounted return.

0reactions

ymd-hcommented, Aug 9, 2021

@naji-s

Although the paper (maybe) doesn’t describe the definition, I think the plots show non-discounted rewards by using models trained with discounted rewards.

As long as discount factor (gamma) is fixed (and n-step is fixed), you can use the discounted reward for model comparison, but it is not universal metric. In order to improve model performance, we try to tune discount factor, so that the metric itself should be independent of discount factor.

Top Results From Across the Web

LAContext evaluatePolicy not showing TouchID prompt

We are using LAContext evaluatePolicy API to show the TouchID prompt in our App. If we place our finger for authentication, then we...

LAContext evaluatePolicy does not always prompt user

I like your solution and can confirm it's working. It's fixed with iOS 13.2 - Can you adjust your code to only include...

Ensure that LAContext evaluatePolicy: reply block is not empty

Using LAContext evaluatePolicy: method provides a callback reply block reply:(void (^)(BOOL success, NSError *error))reply. This block cannot stay empty, ...

Question about documentation - iOS fingerprint bypass #136

An important note to make here is that [LAContext evaluatePolicy:localizedReason:reply:] is not used to access keychain items directly, ...

Troubleshoot common errors - Azure Policy | Microsoft Learn

If you're working with a custom policy, go to the Azure portal to get ... A resource is in the Not Started state,...