question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

unsupervised learning -tsda

See original GitHub issue

Hi, I used TSDA method to pretrain a BERT model on a corpus of sentences and I got this error:

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

and then used CUDA_LAUNCH_BLOCKING=1 python [YOUR_PROGRAM] to trace the error and got this:

RuntimeError: CUDA error: device-side assert triggered

any help?

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:21 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
ReySadeghicommented, May 31, 2021

@kwang2049 @nreimers hi, I ran the code snippet mentioned above to add 10k new tokens, after 1 epoch training , when I want to use saved model to vectorize sentences, I got this error: AssertionError: Non-consecutive added token ‘#نوید_افکاری’ found. Should have index 100005 but has index 100006 in saved vocabulary.

Hi @ReySadeghi, I cannot reproduce it: I found it can successfully load the SBERT checkpoint with added tokens. Before a more detailed conversation, could you please do this checking: (to see if there will still be the assertion error without TSDAE)

from sentence_transformers import SentenceTransformer
from sentence_transformers import models


model_name = 'HooshvareLab/bert-fa-base-uncased'
word_embedding_model = models.Transformer(model_name, max_seq_length=250)

existing_word = list(word_embedding_model.tokenizer.vocab.keys())[1000]
vocab = ['<new_word_1>', '<new_word_2>', '<سلامسلام>', existing_word, '<new_subword111>', '<new_subword222>']

print('Before:', word_embedding_model.auto_model.embeddings)
word_embedding_model.tokenizer.add_tokens(vocab)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
print('Now:', word_embedding_model.auto_model.embeddings)

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

train_sentences=[
    'A sentence containing <new_word_1> and <new_word_2>.', 
    'A sentence containing only <new_word_2>.', 
    'A sentence containing <سلامسلام>', 
    f'A sentence containing {existing_word}'
    'A sentence containing <new_subword111>xxx, my<new_subword222>yyyu'
]

model.save('sbert_tokens_added')
model = SentenceTransformer('sbert_tokens_added')
print([model[0].tokenizer.tokenize(sentence) for sentence in train_sentences])

If running this new snippet also reports the error, I think it might be related to your transformers version. And if this works well, you can change the vocab variable above into your new token list and try again.

I tried this and it was ok, but actually I think the problem was due to some tokens that weren’t in utf-8 encoding, when I removed them the problem was solved.

1reaction
kwang2049commented, May 25, 2021

@kwang2049 @nreimers hi, I ran the code snippet mentioned above to add 10k new tokens, after 1 epoch training , when I want to use saved model to vectorize sentences, I got this error:

AssertionError: Non-consecutive added token ‘#نوید_افکاری’ found. Should have index 100005 but has index 100006 in saved vocabulary.

Hi @ReySadeghi, I cannot reproduce it: I found it can successfully load the SBERT checkpoint with added tokens. Before a more detailed conversation, could you please do this checking: (to see if there will still be the assertion error without TSDAE)

from sentence_transformers import SentenceTransformer
from sentence_transformers import models


model_name = 'HooshvareLab/bert-fa-base-uncased'
word_embedding_model = models.Transformer(model_name, max_seq_length=250)

existing_word = list(word_embedding_model.tokenizer.vocab.keys())[1000]
vocab = ['<new_word_1>', '<new_word_2>', '<سلامسلام>', existing_word, '<new_subword111>', '<new_subword222>']

print('Before:', word_embedding_model.auto_model.embeddings)
word_embedding_model.tokenizer.add_tokens(vocab)
word_embedding_model.auto_model.resize_token_embeddings(len(word_embedding_model.tokenizer))
print('Now:', word_embedding_model.auto_model.embeddings)

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=False, pooling_mode_cls_token=True, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

train_sentences=[
    'A sentence containing <new_word_1> and <new_word_2>.', 
    'A sentence containing only <new_word_2>.', 
    'A sentence containing <سلامسلام>', 
    f'A sentence containing {existing_word}'
    'A sentence containing <new_subword111>xxx, my<new_subword222>yyyu'
]

model.save('sbert_tokens_added')
model = SentenceTransformer('sbert_tokens_added')
print([model[0].tokenizer.tokenize(sentence) for sentence in train_sentences])

If running this new snippet also reports the error, I think it might be related to your transformers version. And if this works well, you can change the vocab variable above into your new token list and try again.

Read more comments on GitHub >

github_iconTop Results From Across the Web

TSDAE — Sentence-Transformers documentation
This section shows an example, of how we can train an unsupervised TSDAE (Tranformer-based Denoising AutoEncoder) model with pure sentences as training data ......
Read more >
Trying to understand the experiments in UDA ...
A recent paper called Unsupervised Data Augmentation for Consistency Training has claimed to achieve state-of-the-art results on IMDb using ...
Read more >
TSA Looking for New Tech, ML to Improve Screening ...
The Transportation Security Administration (TSA) is looking to improve its airport screening technology – and its use of machine learning ...
Read more >
Airline Security Through Artificial Intelligence
How the Transportation Security Administration Can Use Machine Learning to ... are similar to those involved in TSA's screening process for checked baggage....
Read more >
Framework of TSA based on machine learning techniques. ...
A general framework of machine learning-based TSA is shown in Figure 1. In terms of classifiers, there are generally two aspects determining their...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found