question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to reproduce Web30K results of GSF (Groupwise Ranking)

See original GitHub issue

Hi,

We are working on a research paper which is related to Groupwise Ranking. We intend to use [Groupwise Ranking] as one of our main baseline methods. However, we found it is hard to reproduce GSF results on the MSLR Web30K dataset.

We can not reproduce any of the GSF alternatives’ results reported on the paper. To make clear if we are going anything wrong, we list the reproducing parameters of GSF(1), which is a GroupWise ranking function with group size = 1.

We use the example code tf_ranking_libsvm.py to train the GSF(1) model. According to the original paper, we use the same hyperparameters reported on it,

data=Web30K-Fold1 (With all-zero labels' query removed)
learning_rate=0.005
train_batch_size=128
num_train_steps=30000
act_function=ReLu
each_layer=(FullConnect-BatchNorm-Relu)
input_process=BatchNorm
hidden_layer_dims=[64, 32, 16]
num_features=136
list_size=200
group_size=1
loss=softmax_loss
dropout_rate=0.0

After 30000 steps’ training, the NDCG@5 on test set is 39.56517, which is much lower than the result on paper: 40.4.

Is there any other parameters we ignored, we are not certain about some hyperparameters not listed on the paper:

  • The paper describes the reported data are run 10 times and average over these results. What is the result used in each time’s running for final average, the best Model during the 30000 step’s training or the last model of 30000 steps’ training?

  • Did GroupWise ranking use dropout technique, if yes, what is the way it applied

  • Did learning rate decay/ warmup be applied during training?

  • Did feature normalization using global feature mean/variance(not BN) be applied before fed into the score function?

  • What’re the BN parameters used in GSF?

  • Since the TF-Ranking uses padding technique to make the dynamic query doc size available, what’s the list_size used in producing GroupWise-ranking, and will padded record influence BatchNormalization’s computation?

We will be sincerely grateful if any support is given.

Thanks

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
eggie5commented, Dec 10, 2019

Why didn’t I think to reset the LIST_SIZE during testing!?

1reaction
sbruchcommented, Dec 10, 2019

Thanks for reaching out to us! I’ll try to answer your questions based on what I understand from the paper and speaking with a few of the authors.

Q1. The best model based on performance on the validation set. Q2. I don’t believe it used dropout for Web30K experiments. Q3. No learning rate decay/warmup was applied. Q4. Only batch normalization was used between each layer as well as over the input layer. Q5. Default values. Q6. During training ‘list_size’ (as you correctly pointed out) is set to 200, but for evaluation it is set to the maximum list size (e.g., 1300).

One thing you didn’t ask but may be relevant is that the paper uses Adagrad to optimize the ranking loss.

Read more comments on GitHub >

github_iconTop Results From Across the Web

SERank: Optimize Sequencewise Learning to Rank Using ...
Figure 1: FLOPS vs NDCG@5 between different models on Web30K dataset. We use GSF(.) to represent the Groupwise Ranking model.
Read more >
Self-Attentive Document Interaction Networks for Permutation ...
We conduct experiments on three ranking datasets: the benchmark. Web30k, a Gmail search, and a Google Drive ick Access dataset. Experimental results show ......
Read more >
Conditional Sequential Slate Optimization - SIGIR eCom
For example, groupwise scoring functions (GSF) [3] proposes multi- variate scoring functions to jointly score documents and Seq2Slate.
Read more >
Microsoft Learning to Rank Datasets
We released two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of...
Read more >
Context-Aware Learning to Rank with Self-Attention
results on MSLR-WEB30K, the learning to rank benchmark. CCS CONCEPTS ... a pairwise scoring function [12] and Groupwise Scoring. Function (GSF) [2], which ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found