Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to reproduce Web30K results of GSF (Groupwise Ranking)

See original GitHub issue

Hi，

We are working on a research paper which is related to Groupwise Ranking. We intend to use [Groupwise Ranking] as one of our main baseline methods. However, we found it is hard to reproduce GSF results on the MSLR Web30K dataset.

We can not reproduce any of the GSF alternatives’ results reported on the paper. To make clear if we are going anything wrong, we list the reproducing parameters of GSF(1), which is a GroupWise ranking function with group size = 1.

We use the example code tf_ranking_libsvm.py to train the GSF(1) model. According to the original paper, we use the same hyperparameters reported on it,

data=Web30K-Fold1 (With all-zero labels' query removed)
learning_rate=0.005
train_batch_size=128
num_train_steps=30000
act_function=ReLu
each_layer=(FullConnect-BatchNorm-Relu)
input_process=BatchNorm
hidden_layer_dims=[64, 32, 16]
num_features=136
list_size=200
group_size=1
loss=softmax_loss
dropout_rate=0.0

After 30000 steps’ training, the NDCG@5 on test set is 39.56517, which is much lower than the result on paper: 40.4.

Is there any other parameters we ignored, we are not certain about some hyperparameters not listed on the paper:

The paper describes the reported data are run 10 times and average over these results. What is the result used in each time’s running for final average, the best Model during the 30000 step’s training or the last model of 30000 steps’ training?
Did GroupWise ranking use dropout technique, if yes, what is the way it applied
Did learning rate decay/ warmup be applied during training?
Did feature normalization using global feature mean/variance(not BN) be applied before fed into the score function?
What’re the BN parameters used in GSF?
Since the TF-Ranking uses padding technique to make the dynamic query doc size available, what’s the list_size used in producing GroupWise-ranking, and will padded record influence BatchNormalization’s computation?

We will be sincerely grateful if any support is given.

Thanks

Issue Analytics

State:
Created 4 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

eggie5commented, Dec 10, 2019

Why didn’t I think to reset the LIST_SIZE during testing!?

1reaction

sbruchcommented, Dec 10, 2019

Thanks for reaching out to us! I’ll try to answer your questions based on what I understand from the paper and speaking with a few of the authors.

Q1. The best model based on performance on the validation set. Q2. I don’t believe it used dropout for Web30K experiments. Q3. No learning rate decay/warmup was applied. Q4. Only batch normalization was used between each layer as well as over the input layer. Q5. Default values. Q6. During training ‘list_size’ (as you correctly pointed out) is set to 200, but for evaluation it is set to the maximum list size (e.g., 1300).

One thing you didn’t ask but may be relevant is that the paper uses Adagrad to optimize the ranking loss.

Top Results From Across the Web

SERank: Optimize Sequencewise Learning to Rank Using ...

Figure 1: FLOPS vs NDCG@5 between different models on Web30K dataset. We use GSF(.) to represent the Groupwise Ranking model.

Self-Attentive Document Interaction Networks for Permutation ...

We conduct experiments on three ranking datasets: the benchmark. Web30k, a Gmail search, and a Google Drive ick Access dataset. Experimental results show ......

Conditional Sequential Slate Optimization - SIGIR eCom

For example, groupwise scoring functions (GSF) [3] proposes multi- variate scoring functions to jointly score documents and Seq2Slate.

Microsoft Learning to Rank Datasets

We released two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of...

Context-Aware Learning to Rank with Self-Attention

results on MSLR-WEB30K, the learning to rank benchmark. CCS CONCEPTS ... a pairwise scoring function [12] and Groupwise Scoring. Function (GSF) [2], which ......