Questions about Topk REINFORCE
See original GitHub issueHello, thanks for sharing!
I have some questions about pi_beta_sample
in models.py, you use this function in _select_action_with_TopK_correction
, but it seems only sample one item each time?
I am also confused by Equation 6 in the original paper,
as we want to sample a set of top k item, shouldn’t it be
? a_{t, i} represent the ith item at time t.
I appreciate any comments for my question since it’s been bothering me for a long time
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Top-K Off-Policy Correction for a REINFORCE Recommender ...
Reinforce is similar to Q Learning. Basically you need to understand the difference between value and policy iteration: Policy iteration ...
Read more >"Top-K Off-Policy Correction for a REINFORCE Recommender ...
The new A.I., known as Reinforce [sic], was a kind of long-term addiction machine. It was designed to maximize users' engagement over time...
Read more >Question about the weight for correction in the importance sampling ...
Question about the weight for correction in the importance sampling #7 ... According to the paper "Top-K Off-Policy Correction for a REINFORCE Recommender ......
Read more >RL in RecSys, an overview - Sergey Kolesnikov - Medium
These questions have led to the emergence of a new type of recommender ... Top-K Off-Policy Correction for a REINFORCE Recommender System.
Read more >T-LAK cell-originated protein kinase (TOPK) - NCBI - NIH
TOPK facilitates the fidelity and duration of mitosis in actively dividing tissues, predominantly via its influence over checkpoint control ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
In case someone else might come back to this at some point: I was wondering the same thing and I implemented it in the scenario where per slate only one action can/will be clicked anyway, hence when receiving feedback we know which item that feedback responds to.
I guess the authors did the same thing because this sounds like it:
with footnote 3 saying:
Ok, I will keep watching this repository, please let me know if you have any new thought, and thanks for your sharing too.