Windows: Can't find vocabulary file for MarianTokenizer
See original GitHub issue🐛 Bug MarianTokenizer.from_pretrained() fails in Python 3.6.4 in Windows 10
Information
Occurs with using the example here: https://huggingface.co/transformers/model_doc/marian.html?highlight=marianmtmodel#transformers.MarianMTModel
Model I am using (Bert, XLNet …): MarianMTModel
Language I am using the model on (English, Chinese …): English
The problem arises when using:
- [X ] the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- [X ] my own task or dataset: (give details below)
To reproduce
Paste code from example and run:
from transformers import MarianTokenizer, MarianMTModel
from typing import List
src = 'fr' # source language
trg = 'en' # target language
sample_text = "où est l'arrêt de bus ?"
mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'
model = MarianMTModel.from_pretrained(mname)
tok = MarianTokenizer.from_pretrained(mname)
batch = tok.prepare_translation_batch(src_texts=[sample_text]) # don't need tgt_text for inference
gen = model.generate(**batch) # for forward pass: model(**batch)
words: List[str] = tok.batch_decode(gen, skip_special_tokens=True) # returns "Where is the the bus stop ?"
print(words)
Steps to reproduce the behavior:
- Run the example
- Program terminates on
tok = MarianTokenizer.from_pretrained(mname)
stdbuf was not found; communication with perl may hang due to stdio buffering.
Traceback (most recent call last):
File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1055, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 89, in __init__
self._setup_normalizer()
File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 95, in _setup_normalizer
self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang)
File "C:\Program Files\Python\lib\site-packages\mosestokenizer\punctnormalizer.py", line 47, in __init__
super().__init__(argv)
File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 64, in __init__
self.start()
File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 108, in start
env=env,
File "C:\Program Files\Python\lib\subprocess.py", line 709, in __init__
restore_signals, start_new_session)
File "C:\Program Files\Python\lib\subprocess.py", line 997, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Development/Research/COVID-19-Misinfo2/src/translate_test_2.py", line 9, in <module>
tok = MarianTokenizer.from_pretrained(mname)
File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 902, in from_pretrained
return cls._from_pretrained(*inputs, **kwargs)
File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1058, in _from_pretrained
"Unable to load vocabulary from file. "
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.
Expected behavior
prints [“Where is the the bus stop ?”]
Environment info
transformers
version: 2.9.1- Platform: Windows-10-10.0.18362-SP0
- Python version: 3.6.4
- PyTorch version (GPU?): 1.5.0+cu101 (True)
- Tensorflow version (GPU?): 2.1.0 (True)
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Issue Analytics
- State:
- Created 3 years ago
- Comments:17 (3 by maintainers)
Top Results From Across the Web
MarianMT - Hugging Face
We're on a journey to advance and democratize artificial intelligence through open source and open science.
Read more >Documentation - Marian NMT
marian-vocab: creating a vocabulary from text given on STDIN. marian-conv: converting a model into a binary format. The amun tool offering CPU ...
Read more >Huggingface AutoTokenizer can't load from local path
I'm trying to run language model finetuning script (run_language_modeling.py) from huggingface examples with my own tokenizer(just added in ...
Read more >Hugging Face Pre-trained Models: Find the Best One for Your ...
Hugging Face is focused on Natural Language Processing(NLP) tasks and the idea is not to just recognize words but to understand the meaning...
Read more >Training a Grammar Error Correction (GEC) Model ... - Medium
tokenizer = MarianTokenizer.from_pretrained('path/to/gec/model')) ... I wouldn't be writing this article, that I couldn't find one.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Just upgraded to version 3.0, and everything is working!
Hi Sam,
I just rebased, verified the gitlog, and installed using “pip install –upgrade .” I’m attaching the console record of the install.
I still get the same error(s)
2020-06-17 05:40:43.980254: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll stdbuf was not found; communication with perl may hang due to stdio buffering. Traceback (most recent call last): File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils_base.py”, line 1161, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py”, line 81, in init self._setup_normalizer() File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py”, line 87, in _setup_normalizer self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang) File “C:\Program Files\Python\lib\site-packages\mosestokenizer\punctnormalizer.py”, line 47, in init super().init(argv) File “C:\Program Files\Python\lib\site-packages\toolwrapper.py”, line 64, in init self.start() File “C:\Program Files\Python\lib\site-packages\toolwrapper.py”, line 108, in start env=env, File “C:\Program Files\Python\lib\subprocess.py”, line 709, in init restore_signals, start_new_session) File “C:\Program Files\Python\lib\subprocess.py”, line 997, in _execute_child startupinfo) FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File “C:/Users/Phil/AppData/Roaming/JetBrains/IntelliJIdea2020.1/scratches/transformers_error.py”, line 9, in <module> tok = MarianTokenizer.from_pretrained(mname) File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils_base.py”, line 1008, in from_pretrained return cls._from_pretrained(*inputs, **kwargs) File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils_base.py”, line 1164, in _from_pretrained "Unable to load vocabulary from file. " OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.
Process finished with exit code 1
Hope this helps
Phil
On 2020-06-16 09:50, Sam Shleifer wrote:
Links:
[1] https://github.com/huggingface/transformers/issues/4491#issuecomment-644778862 [2] https://github.com/notifications/unsubscribe-auth/ABPRJH5BKWN3OBT7DOP4PVTRW52CRANCNFSM4NGLYESA