Unable to reproduce Web30K results of GSF (Groupwise Ranking)
See original GitHub issueHi,
We are working on a research paper which is related to Groupwise Ranking. We intend to use [Groupwise Ranking] as one of our main baseline methods. However, we found it is hard to reproduce GSF results on the MSLR Web30K dataset.
We can not reproduce any of the GSF alternatives’ results reported on the paper. To make clear if we are going anything wrong, we list the reproducing parameters of GSF(1), which is a GroupWise ranking function with group size = 1.
We use the example code tf_ranking_libsvm.py to train the GSF(1) model. According to the original paper, we use the same hyperparameters reported on it,
data=Web30K-Fold1 (With all-zero labels' query removed)
learning_rate=0.005
train_batch_size=128
num_train_steps=30000
act_function=ReLu
each_layer=(FullConnect-BatchNorm-Relu)
input_process=BatchNorm
hidden_layer_dims=[64, 32, 16]
num_features=136
list_size=200
group_size=1
loss=softmax_loss
dropout_rate=0.0
After 30000 steps’ training, the NDCG@5 on test set is 39.56517, which is much lower than the result on paper: 40.4.
Is there any other parameters we ignored, we are not certain about some hyperparameters not listed on the paper:
-
The paper describes the reported data are run 10 times and average over these results. What is the result used in each time’s running for final average, the best Model during the 30000 step’s training or the last model of 30000 steps’ training?
-
Did GroupWise ranking use dropout technique, if yes, what is the way it applied
-
Did learning rate decay/ warmup be applied during training?
-
Did feature normalization using global feature mean/variance(not BN) be applied before fed into the score function?
-
What’re the BN parameters used in GSF?
-
Since the TF-Ranking uses padding technique to make the dynamic query doc size available, what’s the list_size used in producing GroupWise-ranking, and will padded record influence BatchNormalization’s computation?
We will be sincerely grateful if any support is given.
Thanks
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (4 by maintainers)

Top Related StackOverflow Question
Why didn’t I think to reset the
LIST_SIZEduring testing!?Thanks for reaching out to us! I’ll try to answer your questions based on what I understand from the paper and speaking with a few of the authors.
Q1. The best model based on performance on the validation set. Q2. I don’t believe it used dropout for Web30K experiments. Q3. No learning rate decay/warmup was applied. Q4. Only batch normalization was used between each layer as well as over the input layer. Q5. Default values. Q6. During training ‘list_size’ (as you correctly pointed out) is set to 200, but for evaluation it is set to the maximum list size (e.g., 1300).
One thing you didn’t ask but may be relevant is that the paper uses Adagrad to optimize the ranking loss.