Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

unsupervised learning -tsda

See original GitHub issue

Hi, I used TSDA method to pretrain a BERT model on a corpus of sentences and I got this error:

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

and then used CUDA_LAUNCH_BLOCKING=1 python [YOUR_PROGRAM] to trace the error and got this:

RuntimeError: CUDA error: device-side assert triggered

any help?

Issue Analytics

State:
Created 2 years ago
Comments:21 (10 by maintainers)

Top GitHub Comments

1reaction

ReySadeghicommented, May 31, 2021

@kwang2049 @nreimers hi, I ran the code snippet mentioned above to add 10k new tokens, after 1 epoch training , when I want to use saved model to vectorize sentences, I got this error: AssertionError: Non-consecutive added token ‘#نوید_افکاری’ found. Should have index 100005 but has index 100006 in saved vocabulary.

Hi @ReySadeghi, I cannot reproduce it: I found it can successfully load the SBERT checkpoint with added tokens. Before a more detailed conversation, could you please do this checking: (to see if there will still be the assertion error without TSDAE)
from sentence_transformers import SentenceTransformer
from sentence_transformers import models


model_name = 'HooshvareLab/bert-fa-base-uncased'
word_embedding_model = models.Transformer(model_name, max_seq_length=250)

existing_word = list(word_embedding_model.tokenizer.vocab.keys())[1000]
vocab = ['<new_word_1>', '<new_word_2>', '<سلامسلام>', existing_word, '<new_subword111>', '<new_subword222>']

print('Before:', word_embedding_model.auto_model.embeddings)
word_embedding_model.tokenizer.add_tokens(vocab)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
print('Now:', word_embedding_model.auto_model.embeddings)

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

train_sentences=[
    'A sentence containing <new_word_1> and <new_word_2>.', 
    'A sentence containing only <new_word_2>.', 
    'A sentence containing <سلامسلام>', 
    f'A sentence containing {existing_word}'
    'A sentence containing <new_subword111>xxx, my<new_subword222>yyyu'
]

model.save('sbert_tokens_added')
model = SentenceTransformer('sbert_tokens_added')
print([model[0].tokenizer.tokenize(sentence) for sentence in train_sentences])
If running this new snippet also reports the error, I think it might be related to your transformers version. And if this works well, you can change the vocab variable above into your new token list and try again.

I tried this and it was ok, but actually I think the problem was due to some tokens that weren’t in utf-8 encoding, when I removed them the problem was solved.

1reaction

kwang2049commented, May 25, 2021

@kwang2049 @nreimers hi, I ran the code snippet mentioned above to add 10k new tokens, after 1 epoch training , when I want to use saved model to vectorize sentences, I got this error:

AssertionError: Non-consecutive added token ‘#نوید_افکاری’ found. Should have index 100005 but has index 100006 in saved vocabulary.

Hi @ReySadeghi, I cannot reproduce it: I found it can successfully load the SBERT checkpoint with added tokens. Before a more detailed conversation, could you please do this checking: (to see if there will still be the assertion error without TSDAE)

from sentence_transformers import SentenceTransformer
from sentence_transformers import models


model_name = 'HooshvareLab/bert-fa-base-uncased'
word_embedding_model = models.Transformer(model_name, max_seq_length=250)

existing_word = list(word_embedding_model.tokenizer.vocab.keys())[1000]
vocab = ['<new_word_1>', '<new_word_2>', '<سلامسلام>', existing_word, '<new_subword111>', '<new_subword222>']

print('Before:', word_embedding_model.auto_model.embeddings)
word_embedding_model.tokenizer.add_tokens(vocab)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
print('Now:', word_embedding_model.auto_model.embeddings)

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

train_sentences=[
    'A sentence containing <new_word_1> and <new_word_2>.', 
    'A sentence containing only <new_word_2>.', 
    'A sentence containing <سلامسلام>', 
    f'A sentence containing {existing_word}'
    'A sentence containing <new_subword111>xxx, my<new_subword222>yyyu'
]

model.save('sbert_tokens_added')
model = SentenceTransformer('sbert_tokens_added')
print([model[0].tokenizer.tokenize(sentence) for sentence in train_sentences])

If running this new snippet also reports the error, I think it might be related to your transformers version. And if this works well, you can change the vocab variable above into your new token list and try again.

Top Results From Across the Web

TSDAE — Sentence-Transformers documentation

This section shows an example, of how we can train an unsupervised TSDAE (Tranformer-based Denoising AutoEncoder) model with pure sentences as training data ......

Trying to understand the experiments in UDA ...

A recent paper called Unsupervised Data Augmentation for Consistency Training has claimed to achieve state-of-the-art results on IMDb using ...

TSA Looking for New Tech, ML to Improve Screening ...

The Transportation Security Administration (TSA) is looking to improve its airport screening technology – and its use of machine learning ...

Airline Security Through Artificial Intelligence

How the Transportation Security Administration Can Use Machine Learning to ... are similar to those involved in TSA's screening process for checked baggage....

Framework of TSA based on machine learning techniques. ...

A general framework of machine learning-based TSA is shown in Figure 1. In terms of classifiers, there are generally two aspects determining their...