question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Discount not applied in evaluate_policy?

See original GitHub issue

Maybe I am missing something here but I feel the line to calculate return: https://github.com/keiohta/tf2rl/blob/82d9eecda78e22021efa0821bf02429ac7827f4d/tf2rl/experiments/trainer.py#L207

should be updated to include the discount factor

            for j in range(total_steps):
                action = self._policy.get_action(obs, test=True)
                next_obs, reward, done, _ = self._test_env.step(action)
                avg_test_steps += 1
                if self._save_test_path:
                    replay_buffer.add(obs=obs, act=action,
                                      next_obs=next_obs, rew=reward, done=done)

                if self._save_test_movie:
                    element = self._test_env.render(mode='rgb_array')
                    frames.append(element)
                elif self._show_test_progress:
                    self._test_env.render()
                episode_return += reward * np.power(self._policy.discount, j)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
keiohtacommented, Jul 29, 2021

I agree with @ymd-h that the evaluation score does not include the discount factor. I think the reason why the DDQN paper reports the discounted return is to evaluate the overestimation phenomenon: since the Q-network produces the estimated discounted cumulative rewards, the “true” return should be computed with the discount factor. I don’t think other paper reports discounted return.

0reactions
ymd-hcommented, Aug 9, 2021

@naji-s

Although the paper (maybe) doesn’t describe the definition, I think the plots show non-discounted rewards by using models trained with discounted rewards.

As long as discount factor (gamma) is fixed (and n-step is fixed), you can use the discounted reward for model comparison, but it is not universal metric. In order to improve model performance, we try to tune discount factor, so that the metric itself should be independent of discount factor.

Read more comments on GitHub >

github_iconTop Results From Across the Web

LAContext evaluatePolicy not showing TouchID prompt
We are using LAContext evaluatePolicy API to show the TouchID prompt in our App. If we place our finger for authentication, then we...
Read more >
LAContext evaluatePolicy does not always prompt user
I like your solution and can confirm it's working. It's fixed with iOS 13.2 - Can you adjust your code to only include...
Read more >
Ensure that LAContext evaluatePolicy: reply block is not empty
Using LAContext evaluatePolicy: method provides a callback reply block reply:(void (^)(BOOL success, NSError *error))reply. This block cannot stay empty, ...
Read more >
Question about documentation - iOS fingerprint bypass #136
An important note to make here is that [LAContext evaluatePolicy:localizedReason:reply:] is not used to access keychain items directly, ...
Read more >
Troubleshoot common errors - Azure Policy | Microsoft Learn
If you're working with a custom policy, go to the Azure portal to get ... A resource is in the Not Started state,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found