question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Questions about Topk REINFORCE

See original GitHub issue

Hello, thanks for sharing! I have some questions about pi_beta_sample in models.py, you use this function in _select_action_with_TopK_correction, but it seems only sample one item each time? I am also confused by Equation 6 in the original paper, mylatex20200109_204056 as we want to sample a set of top k item, shouldn’t it be mylatex20200109_204634? a_{t, i} represent the ith item at time t. I appreciate any comments for my question since it’s been bothering me for a long time

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
almajocommented, Mar 7, 2020

In case someone else might come back to this at some point: I was wondering the same thing and I implemented it in the scenario where per slate only one action can/will be clicked anyway, hence when receiving feedback we know which item that feedback responds to.

I guess the authors did the same thing because this sounds like it:

(2) While the main policy head π θ is trained using only items on the trajectory with non-zero reward^3 , the behavior policy β θ ′ is trained using all of the items on the trajectory to avoid introducing bias in the β estimate.

with footnote 3 saying:

We ignore them in the user state update as users are unlikely to notice them and as a result, we assume the user state are not influenced by these actions

0reactions
wwwangzhchcommented, Jan 9, 2020

Ok, I will keep watching this repository, please let me know if you have any new thought, and thanks for your sharing too.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Top-K Off-Policy Correction for a REINFORCE Recommender ...
Reinforce is similar to Q Learning. Basically you need to understand the difference between value and policy iteration: Policy iteration ...
Read more >
"Top-K Off-Policy Correction for a REINFORCE Recommender ...
The new A.I., known as Reinforce [sic], was a kind of long-term addiction machine. It was designed to maximize users' engagement over time...
Read more >
Question about the weight for correction in the importance sampling ...
Question about the weight for correction in the importance sampling #7 ... According to the paper "Top-K Off-Policy Correction for a REINFORCE Recommender ......
Read more >
RL in RecSys, an overview - Sergey Kolesnikov - Medium
These questions have led to the emergence of a new type of recommender ... Top-K Off-Policy Correction for a REINFORCE Recommender System.
Read more >
T-LAK cell-originated protein kinase (TOPK) - NCBI - NIH
TOPK facilitates the fidelity and duration of mitosis in actively dividing tissues, predominantly via its influence over checkpoint control ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found