Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training distiluse with TSDAE

See original GitHub issue

I’ll like to further train distiluse-base-multilingual-cased-v1 on a custom dataset using the example provided train_tsdae_from_file.py. I’ve been able to use it to train both bert-base-uncased and stsb-xlm-r-multilingual and actually getting good results with the later, I would like to do the same with distiluse as it gives me a better result with the pretrained model and hopefully it will improve with TSDAE. But I’m getting the following error:

tf-docker /root > python scripts/train_tsdae_from_file.py data/job_text_59k/jobtitle_59k-test.txt 
2021-05-28 00:57:12.412477: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Read file: 2000it [00:00, 1297341.17it/s]
2021-05-28 00:57:14 - 1926 train sentences
Traceback (most recent call last):
  File "scripts/train_tsdae_from_file.py", line 59, in <module>
    word_embedding_model = models.Transformer(model_name)
  File "/usr/local/lib/python3.6/dist-packages/sentence_transformers/models/Transformer.py", line 28, in __init__
    self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
  File "/usr/local/lib/python3.6/dist-packages/transformers/models/auto/auto_factory.py", line 381, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_utils.py", line 1103, in from_pretrained
    f"Error no file named {[WEIGHTS_NAME, TF2_WEIGHTS_NAME, TF_WEIGHTS_NAME + '.index', FLAX_WEIGHTS_NAME]} found in "
OSError: Error no file named ['pytorch_model.bin', 'tf_model.h5', 'model.ckpt.index', 'flax_model.msgpack'] found in directory saved_models/distiluse-base-multilingual-cased-v1/ or `from_tf` and `from_flax` set to False.

Looking the unzipped files it seems that the save format is different as the two former models have the pytorch save file on the same dir as the config while distiluse seem to be formed by several modules. Is it possible to train with TSDAE distiluse?

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

eduardofvcommented, May 28, 2021

Thank you guys, all comments are worth be kept in mind! The reason why I wanted to startt with distiluse is that it is already giving pretty good results (although a little worse than the original USE-M-Lv3, but I really don’t know how I could fine-tune it). I think I will take your advise and make some tests as I need a multilingual language model I may start wth some version of bert. But I may also try you approach @kwang2049 and see what happens. Hope I can share something useful later!

1reaction

kwang2049commented, May 28, 2021

Hi @eduardofv, starting from PLM models like bert-base-uncased and xlm-roberta-base make more sense than from SBERT models which are finetuned on sentence embedding tasks. Actually, in our own results, we found bert-base-uncased->TSDAE->stsb/nli is usually much better than bert-based-uncased->stsb/nli->TSDAE.

For your question, as @nreimers said, it could be kinda tricky to start from this checkpoint due to (1) DistilBERT from HuggingFace (HF) actually has not been extended to support LM head officially; (2) TSDAE builds decoder via observing the encoder config (from HF) and different pooling sizes can be an issue.

To solve this, (1) one can first make extension to support LM head. For DistilBERT, I have personally done it in this Gist. One can download it and import the included modeling_distilbert.py file to get the support of LM head. And (2) then about the size issue, one can either pop the dense layer or add a new dense layer (mapping 512 to 768).

All together to talk in code, it will look like this: (also, thanks for @ScottishFold007’s hint, to build SBERT model from SBERT checkpoints, one need to use SentenceTransformer('checkpoint-name' ) than the other way.)

"""
This file loads sentences from a provided text file. It is expected, that the there is one sentence per line in that text file.

TSDAE will be training using these sentences. Checkpoints are stored every 500 steps to the output folder.

Usage:
python train_tsdae_from_file.py path/to/sentences.txt

"""
from sentence_transformers import SentenceTransformer, LoggingHandler
from sentence_transformers import models, datasets, losses
import modeling_distilbert
import logging
import gzip
from torch.utils.data import DataLoader
from datetime import datetime
import sys
import tqdm

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

# Train Parameters
# model_name = 'bert-base-uncased'
model_name = 'distiluse-base-multilingual-cased-v1'
batch_size = 8

#Input file path (a text file, each line a sentence)
if len(sys.argv) < 2:
    print("Run this script with: python {} path/to/sentences.txt".format(sys.argv[0]))
    exit()

filepath = sys.argv[1]

# Save path to store our model
output_name = ''
if len(sys.argv) >= 3:
    output_name = "-"+sys.argv[2].replace(" ", "_").replace("/", "_").replace("\\", "_")

model_output_path = 'output/train_tsdae{}-{}'.format(output_name, datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))


################# Read the train corpus  #################
train_sentences = []
with gzip.open(filepath, 'rt', encoding='utf8') if filepath.endswith('.gz') else open(filepath, encoding='utf8') as fIn:
    for line in tqdm.tqdm(fIn, desc='Read file'):
        line = line.strip()
        if len(line) >= 10:
            train_sentences.append(line)


logging.info("{} train sentences".format(len(train_sentences)))

################# Intialize an SBERT model #################
model = SentenceTransformer(model_name)
model.__delitem__(-1)
# word_embedding_model = models.Transformer(model_name)
# # Apply **cls** pooling to get one fixed sized sentence vector
# pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), 'cls')
# model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

################# Train and evaluate the model (it needs about 1 hour for one epoch of AskUbuntu) #################
# We wrap our training sentences in the DenoisingAutoEncoderDataset to add deletion noise on the fly
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)


logging.info("Start training")
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    weight_decay=0,
    scheduler='constantlr',
    optimizer_params={'lr': 3e-5},
    show_progress_bar=True,
    checkpoint_path=model_output_path,
    use_amp=False                #Set to True, if your GPU supports FP16 cores
)

Top Results From Across the Web

TSDAE pre-training for DistilBERT · Issue #1311

TSDAE will be training using these sentences. ... 'distilbert-base-uncased' # model_name = 'distiluse-base-multilingual-cased-v1' model_name ...

TSDAE — Sentence-Transformers documentation

This section shows an example, of how we can train an unsupervised TSDAE (Tranformer-based Denoising AutoEncoder) model with pure sentences as training data....

how to fine-tune "distiluse-base-multilingual-cased" model ...

I am trying to do semantic search but pre-trained model is not accurate ... for a semi-supervised similiarity training objective like TSDAE.

TSDAE: Using Transformer-based Sequential Denoising ...

In this work, we present a new state-of-the-art unsupervised method based on pre-trained Transformers and Sequential Denoising Auto-Encoder ( ...

kwang2049/TSDAE-askubuntu2nli_stsb

This model adapts the knowledge from the NLI and STSb data to the specific domain AskUbuntu. Training procedure of this model: Initialized with...