[ALBERT] Tokenization crashes while trying to finetune classifier with TF Hub model
See original GitHub issueI’m trying to get ALBERT running locally with the following command line:
python -m albert.run_classifier_with_tfhub --task_name=MNLI --data_dir=./multinli_1.0 --albert_hub_module_handle=https://tfhub.dev/google/albert_large/1 --output_dir=./output --do_train=True
When tokenizer is initialized from TF Hub model it crashes:
Traceback (most recent call last):
File "/Users/vladimirbugay/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 320, in <module>
tf.app.run()
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 187, in main
tokenizer = create_tokenizer_from_hub_module(FLAGS.albert_hub_module_handle)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 161, in create_tokenizer_from_hub_module
spm_model_file=FLAGS.spm_model_file)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/tokenization.py", line 247, in __init__
self.vocab = load_vocab(vocab_file)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/tokenization.py", line 201, in load_vocab
token = token.strip().split()[0]
IndexError: list index out of range
The issue is with the line being just a newline character ‘\n’. However, even if I modify code to ignore them it still crashes later with
Traceback (most recent call last):
File "/Users/vladimirbugay/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 320, in <module>
tf.app.run()
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 187, in main
tokenizer = create_tokenizer_from_hub_module(FLAGS.albert_hub_module_handle)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 161, in create_tokenizer_from_hub_module
spm_model_file=FLAGS.spm_model_file)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/tokenization.py", line 249, in __init__
self.vocab = load_vocab(vocab_file)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/tokenization.py", line 198, in load_vocab
token = convert_to_unicode(reader.readline())
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 179, in readline
return self._prepare_value(self._read_buf.ReadLineAsString())
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 98, in _prepare_value
return compat.as_str_any(val)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 117, in as_str_any
return as_str(value)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 87, in as_text
return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 8: invalid start byte
I’m running the code on OS X Catalina, Anaconda, Python 3.6
sentencepiece 0.1.83 pypi_0 pypi
tensorflow 1.14.0 mkl_py36h933f829_0
tensorflow-base 1.14.0 mkl_py36h655c25b_0
tensorflow-estimator 1.14.0 py_0
tensorflow-hub 0.6.0 pyhe1b5a44_0 conda-forge
Issue Analytics
- State:
- Created 4 years ago
- Comments:6
Top Results From Across the Web
What to do when you get an error - Hugging Face Course
In this section we'll look at some common errors that can occur when you're trying to generate predictions from your freshly tuned Transformer...
Read more >Fine-tuning a BERT model | Text - TensorFlow
This tutorial demonstrates how to fine-tune a Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) model ...
Read more >Recently Active 'bert-language-model' Questions
I am trying to inference through Bert model. I have tokenized my input using the padding parameter during training as shown below. encoding...
Read more >IMPROVING DEEP QUESTION ANSWERING: THE ALBERT ...
Chapter 4 describes ALBERT [21], the model this work focuses on and the necessary steps involved in its ... B.5 ALBERT with Binary...
Read more >Empirical Study on the Software Engineering Practices in ...
This empirical work is trying to fill those research gaps. ... While TFHub only contains models developed with TensorFlow [20] and PyTorch ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
It turned out that the file which the script tries to load as a vocabulary in fact is saved SentencePiece model. The change in the lines 159-161 did the trick:
vocab_file is ignored when spm_model_file is set and there is no way to pass null vocab_file so there is no issue.
I managed to get the train data loaded and preprocessed, but then the training crashes further with
same problem here. ignoring empty lines leads to second encoding error.