question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

reproducing the paper's best results

See original GitHub issue

I’ve tried to replicate the paper. For bert-base-nli-mean-tokens, the model which was trained from scratch with your code reached 74.71 of cosine-similarity on the sts-test set. It is way too low compared to the score on the paper. Any thoughts?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
nreimerscommented, Dec 4, 2019

Hi, the problem occurred when I updated huggingface pytorch-transformers to version 1.x (which was with version 2.0 of sentence-transformers): Performances magically dropped, even though the setup was as before.

I performed extensive debugging, copying old code from huggingface, but sadly never found a way to fix it. Interesting, when loading the weights trained with the old huggingface code, the same performances were still achieved. So something must have changed in the training procedure of huggingface code that leads to this inferior performance with version 1 of pytorch-transformers. Maybe the optimizer code is a bit different?

I was not the only person affected by this, but several people mentioned this in the huggingface repo (see https://github.com/huggingface/transformers/issues/938), that they now achieve slightly worse performances. The reason is unclear.

I will soon be able to update pytorch-transformers to version 2. Maybe the issue is resolved in that version? Who knows.

If you like to reproduce the old sts experiment scores, I recommend to use the older versions of this repository, one that uses pytorch-transformers 0.x version.

Best regards Nils Reimers

0reactions
nreimerscommented, Dec 17, 2019

Hi @K-Mike I used in the paper bert-as-a-service with mean-pooling. Here is the code I used:

from __future__ import absolute_import, division, unicode_literals

import sys
import io
import numpy as np
import logging

import os
from bert_serving.client import BertClient

# Set PATHs
PATH_TO_SENTEVAL = '../'
PATH_TO_DATA = '../data'

# import SentEval
sys.path.insert(0, PATH_TO_SENTEVAL)
import senteval

# SentEval prepare and batcher
def prepare(params, samples):
    pass

bc = BertClient(ip='localhost', check_length=False)
def batcher(params, batch):
    sentences = []
    for sample in batch:
        untoken = ' '.join(sample).lower()
        if untoken == '':
            untoken = '-'

        sentences.append(untoken)
    return bc.encode(sentences)


# Set params for SentEval
#params_senteval = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5}
#params_senteval['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128, 'tenacity': 3, 'epoch_size': 2}

# Parameters suggested by Readme & https://github.com/facebookresearch/SentEval/issues/43
params_senteval = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params_senteval['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64, 'tenacity': 5, 'epoch_size': 4}

# Set up logger
logging.basicConfig(format='%(asctime)s : %(message)s', level=logging.DEBUG)

if __name__ == "__main__":
    se = senteval.engine.SE(params_senteval, batcher, prepare)
    transfer_tasks = ['MR', 'CR', 'SUBJ', 'MPQA', 'SICKEntailment', 'SST2', 'TREC', 'MRPC']
    results = se.eval(transfer_tasks)
    print(results)

As I learned later (pointed out in one of the issues here): Bert-as-a-service interprets mean-pooling a bit differently.

In the default strategy REDUCE_MEAN, I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.

This might be the cause of the differences? Maybe taking only the last layer and perform mean pooling might be better than the REDUCE_MEAN pooling from bert-as-a-service? Would be interesting to see which is better.

Best Nils Reimers

Read more comments on GitHub >

github_iconTop Results From Across the Web

Reproducibility - Wikipedia
This article is about the reproducibility of scientific research results. For reproductive capacity of organisms, see fertility and fecundity.
Read more >
Images, Charts, Graphs, Maps & Tables - APA Citation Guide ...
Reproducing happens when you copy or recreate an image, table, graph or chart that is not your original creation. If you reproduce one...
Read more >
Tips to Improve Your Research Paper: The Results Section
The most common way to organize results is from most to least important. Using this method has a couple of advantages. You can...
Read more >
Science has been in a “replication crisis” for a decade ... - Vox
But when the attempted replication finds different or no results, that often suggests that the original research finding was spurious.
Read more >
11 steps to structuring a science paper editors will take seriously
Prepare the figures and tables. · Write the Methods. · Write up the Results. · Write the Discussion. Finalize the Results and Discussion...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found