Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

error when trying to use multilingual model for fine tuning

See original GitHub issue

I wanted to use fine tuning for hindi language data. For that I tried to give bert-base-mutlilingual model but I am getting the following error

python pregenerate_training_data.py --train_corpus=./hindi_pytorch_bert_data_1.txt --bert_model=bert-base-multilingual --output_dir=./hindi_train_data_1_3epochs/ --epochs_to_generate=3

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Model name 'bert-base-multilingual' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'bert-base-multilingual' was a path or url but couldn't find any file associated to this path or url.
Traceback (most recent call last):
  File "pregenerate_training_data.py", line 292, in <module>
    main()
  File "pregenerate_training_data.py", line 255, in main
    vocab_list = list(tokenizer.vocab.keys())
AttributeError: 'NoneType' object has no attribute 'vocab'

I tried giving bert-base-multilingual-cased as well then I ran into this error

python pregenerate_training_data.py --train_corpus=./hindi_pytorch_bert_data_1.txt --bert_model=bert-base-multilingual-cased --output_dir=./hindi_train_data_1_3epochs/ --epochs_to_generate=3

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
usage: pregenerate_training_data.py [-h] --train_corpus TRAIN_CORPUS
                                    --output_dir OUTPUT_DIR --bert_model
                                    {bert-base-uncased,bert-large-uncased,bert-base-cased,bert-base-multilingual,bert-base-chinese}
                                    [--do_lower_case] [--reduce_memory]
                                    [--epochs_to_generate EPOCHS_TO_GENERATE]
                                    [--max_seq_len MAX_SEQ_LEN]
                                    [--short_seq_prob SHORT_SEQ_PROB]
                                    [--masked_lm_prob MASKED_LM_PROB]
                                    [--max_predictions_per_seq MAX_PREDICTIONS_PER_SEQ]
pregenerate_training_data.py: error: argument --bert_model: invalid choice: 'bert-base-multilingual-cased' (choose from 'bert-base-uncased', 'bert-large-uncased', 'bert-base-cased', 'bert-base-multilingual', 'bert-base-chinese')

How to resolve this issue?

Issue Analytics

State:
Created 4 years ago
Comments:6

Top GitHub Comments

2reactions

shirancohen2016commented, May 12, 2019

Hi, I followed your code, and got this error:

Traceback (most recent call last): | 6796/185072 [00:00<00:18, 9787.42it/s] File “pregenerate_training_data.py”, line 308, in <module> main() File “pregenerate_training_data.py”, line 293, in main vocab_list=vocab_list) File “pregenerate_training_data.py”, line 208, in create_instances_from_document assert len(tokens_b) >= 1 AssertionError

Can you please share your code?

0reactions

stale[bot]commented, Sep 7, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Top Results From Across the Web

error when trying to use multilingual model for fine tuning #511

I wanted to use fine tuning for hindi language data. For that I tried to give bert-base-mutlilingual model but I am getting the...

Fine-Tune Whisper For Multilingual ASR with Transformers

In this blog, we present a step-by-step guide on fine-tuning Whisper for any multilingual ASR dataset using Hugging Face Transformers.

Multilingual fine-tuning for Grammatical Error Correction

Finding a single model capable of comprehending multiple languages is an area of active research in Natural Language Processing (NLP).

Fine-Tune Universal Sentence Encoder Large with TF2

Below is my code for fine-tuning the Universal Sentence Encoder Multilingual Large 2. I am not able to resolve the resulting error. I...

Unsupervised Training for Sentence Transformers - Pinecone

We will learn to train these models using the unsupervised ... sentence transformer and use a fine-tuning process called multilingual knowledge distillation ...