question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

error when trying to use multilingual model for fine tuning

See original GitHub issue

I wanted to use fine tuning for hindi language data. For that I tried to give bert-base-mutlilingual model but I am getting the following error

python pregenerate_training_data.py --train_corpus=./hindi_pytorch_bert_data_1.txt --bert_model=bert-base-multilingual --output_dir=./hindi_train_data_1_3epochs/ --epochs_to_generate=3

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Model name 'bert-base-multilingual' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'bert-base-multilingual' was a path or url but couldn't find any file associated to this path or url.
Traceback (most recent call last):
  File "pregenerate_training_data.py", line 292, in <module>
    main()
  File "pregenerate_training_data.py", line 255, in main
    vocab_list = list(tokenizer.vocab.keys())
AttributeError: 'NoneType' object has no attribute 'vocab'

I tried giving bert-base-multilingual-cased as well then I ran into this error

python pregenerate_training_data.py --train_corpus=./hindi_pytorch_bert_data_1.txt --bert_model=bert-base-multilingual-cased --output_dir=./hindi_train_data_1_3epochs/ --epochs_to_generate=3

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
usage: pregenerate_training_data.py [-h] --train_corpus TRAIN_CORPUS
                                    --output_dir OUTPUT_DIR --bert_model
                                    {bert-base-uncased,bert-large-uncased,bert-base-cased,bert-base-multilingual,bert-base-chinese}
                                    [--do_lower_case] [--reduce_memory]
                                    [--epochs_to_generate EPOCHS_TO_GENERATE]
                                    [--max_seq_len MAX_SEQ_LEN]
                                    [--short_seq_prob SHORT_SEQ_PROB]
                                    [--masked_lm_prob MASKED_LM_PROB]
                                    [--max_predictions_per_seq MAX_PREDICTIONS_PER_SEQ]
pregenerate_training_data.py: error: argument --bert_model: invalid choice: 'bert-base-multilingual-cased' (choose from 'bert-base-uncased', 'bert-large-uncased', 'bert-base-cased', 'bert-base-multilingual', 'bert-base-chinese')

How to resolve this issue?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6

github_iconTop GitHub Comments

2reactions
shirancohen2016commented, May 12, 2019

Hi, I followed your code, and got this error:

Traceback (most recent call last): | 6796/185072 [00:00<00:18, 9787.42it/s] File “pregenerate_training_data.py”, line 308, in <module> main() File “pregenerate_training_data.py”, line 293, in main vocab_list=vocab_list) File “pregenerate_training_data.py”, line 208, in create_instances_from_document assert len(tokens_b) >= 1 AssertionError

Can you please share your code?

0reactions
stale[bot]commented, Sep 7, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

error when trying to use multilingual model for fine tuning #511
I wanted to use fine tuning for hindi language data. For that I tried to give bert-base-mutlilingual model but I am getting the...
Read more >
Fine-Tune Whisper For Multilingual ASR with Transformers
In this blog, we present a step-by-step guide on fine-tuning Whisper for any multilingual ASR dataset using Hugging Face Transformers.
Read more >
Multilingual fine-tuning for Grammatical Error Correction
Finding a single model capable of comprehending multiple languages is an area of active research in Natural Language Processing (NLP).
Read more >
Fine-Tune Universal Sentence Encoder Large with TF2
Below is my code for fine-tuning the Universal Sentence Encoder Multilingual Large 2. I am not able to resolve the resulting error. I...
Read more >
Unsupervised Training for Sentence Transformers - Pinecone
We will learn to train these models using the unsupervised ... sentence transformer and use a fine-tuning process called multilingual knowledge distillation ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found