Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Questions about Topk REINFORCE

See original GitHub issue

Hello, thanks for sharing! I have some questions about pi_beta_sample in models.py, you use this function in _select_action_with_TopK_correction, but it seems only sample one item each time? I am also confused by Equation 6 in the original paper, mylatex20200109_204056 as we want to sample a set of top k item, shouldn’t it be mylatex20200109_204634 ? a_{t, i} represent the ith item at time t. I appreciate any comments for my question since it’s been bothering me for a long time

Issue Analytics

State:
Created 4 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

almajocommented, Mar 7, 2020

In case someone else might come back to this at some point: I was wondering the same thing and I implemented it in the scenario where per slate only one action can/will be clicked anyway, hence when receiving feedback we know which item that feedback responds to.

I guess the authors did the same thing because this sounds like it:

(2) While the main policy head π θ is trained using only items on the trajectory with non-zero reward^3 , the behavior policy β θ ′ is trained using all of the items on the trajectory to avoid introducing bias in the β estimate.

with footnote 3 saying:

We ignore them in the user state update as users are unlikely to notice them and as a result, we assume the user state are not influenced by these actions

0reactions

wwwangzhchcommented, Jan 9, 2020

Ok, I will keep watching this repository, please let me know if you have any new thought, and thanks for your sharing too.

Top Results From Across the Web

Top-K Off-Policy Correction for a REINFORCE Recommender ...

Reinforce is similar to Q Learning. Basically you need to understand the difference between value and policy iteration: Policy iteration ...

"Top-K Off-Policy Correction for a REINFORCE Recommender ...

The new A.I., known as Reinforce [sic], was a kind of long-term addiction machine. It was designed to maximize users' engagement over time...

Question about the weight for correction in the importance sampling ...

Question about the weight for correction in the importance sampling #7 ... According to the paper "Top-K Off-Policy Correction for a REINFORCE Recommender ......

RL in RecSys, an overview - Sergey Kolesnikov - Medium

These questions have led to the emergence of a new type of recommender ... Top-K Off-Policy Correction for a REINFORCE Recommender System.

T-LAK cell-originated protein kinase (TOPK) - NCBI - NIH

TOPK facilitates the fidelity and duration of mitosis in actively dividing tissues, predominantly via its influence over checkpoint control ...