Windows: Can't find vocabulary file for MarianTokenizer

🐛 Bug MarianTokenizer.from_pretrained() fails in Python 3.6.4 in Windows 10

Information

Occurs with using the example here: https://huggingface.co/transformers/model_doc/marian.html?highlight=marianmtmodel#transformers.MarianMTModel

Model I am using (Bert, XLNet …): MarianMTModel

Language I am using the model on (English, Chinese …): English

The problem arises when using:

[X ] the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
[X ] my own task or dataset: (give details below)

To reproduce

Paste code from example and run:

from transformers import MarianTokenizer, MarianMTModel
from typing import List
src = 'fr'  # source language
trg = 'en'  # target language
sample_text = "où est l'arrêt de bus ?"
mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'

model = MarianMTModel.from_pretrained(mname)
tok = MarianTokenizer.from_pretrained(mname)
batch = tok.prepare_translation_batch(src_texts=[sample_text])  # don't need tgt_text for inference
gen = model.generate(**batch)  # for forward pass: model(**batch)
words: List[str] = tok.batch_decode(gen, skip_special_tokens=True)  # returns "Where is the the bus stop ?"
print(words)

Steps to reproduce the behavior:

Run the example
Program terminates on tok = MarianTokenizer.from_pretrained(mname)

stdbuf was not found; communication with perl may hang due to stdio buffering.
Traceback (most recent call last):
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1055, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 89, in __init__
    self._setup_normalizer()
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 95, in _setup_normalizer
    self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang)
  File "C:\Program Files\Python\lib\site-packages\mosestokenizer\punctnormalizer.py", line 47, in __init__
    super().__init__(argv)
  File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 64, in __init__
    self.start()
  File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 108, in start
    env=env,
  File "C:\Program Files\Python\lib\subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "C:\Program Files\Python\lib\subprocess.py", line 997, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Development/Research/COVID-19-Misinfo2/src/translate_test_2.py", line 9, in <module>
    tok = MarianTokenizer.from_pretrained(mname)
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 902, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1058, in _from_pretrained
    "Unable to load vocabulary from file. "
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

Expected behavior

prints [“Where is the the bus stop ?”]

Environment info

transformers version: 2.9.1
Platform: Windows-10-10.0.18362-SP0
Python version: 3.6.4
PyTorch version (GPU?): 1.5.0+cu101 (True)
Tensorflow version (GPU?): 2.1.0 (True)
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Issue Analytics

State:
Created 3 years ago
Comments:17 (3 by maintainers)

Top GitHub Comments

1reaction

pgfeldmancommented, Jun 30, 2020

Just upgraded to version 3.0, and everything is working!

0reactions

pgfeldmancommented, Jun 17, 2020

Hi Sam,

I just rebased, verified the gitlog, and installed using “pip install –upgrade .” I’m attaching the console record of the install.

I still get the same error(s)

2020-06-17 05:40:43.980254: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll stdbuf was not found; communication with perl may hang due to stdio buffering. Traceback (most recent call last): File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils_base.py”, line 1161, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py”, line 81, in init self._setup_normalizer() File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py”, line 87, in _setup_normalizer self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang) File “C:\Program Files\Python\lib\site-packages\mosestokenizer\punctnormalizer.py”, line 47, in init super().init(argv) File “C:\Program Files\Python\lib\site-packages\toolwrapper.py”, line 64, in init self.start() File “C:\Program Files\Python\lib\site-packages\toolwrapper.py”, line 108, in start env=env, File “C:\Program Files\Python\lib\subprocess.py”, line 709, in init restore_signals, start_new_session) File “C:\Program Files\Python\lib\subprocess.py”, line 997, in _execute_child startupinfo) FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File “C:/Users/Phil/AppData/Roaming/JetBrains/IntelliJIdea2020.1/scratches/transformers_error.py”, line 9, in <module> tok = MarianTokenizer.from_pretrained(mname) File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils_base.py”, line 1008, in from_pretrained return cls._from_pretrained(*inputs, **kwargs) File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils_base.py”, line 1164, in _from_pretrained "Unable to load vocabulary from file. " OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

Process finished with exit code 1

Hope this helps

Phil

On 2020-06-16 09:50, Sam Shleifer wrote:

I think this bug may be fixed on master, but I can’t verify because I don’t have windows. Could 1 person check and post their results? Remember to be up to date with master, your git log should contain 3d495c61e Sam Shleifer: Fix marian tokenizer save pretrained (#5043) - (HEAD -> master, upstream/master) (2 minutes ago)

– You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [1], or unsubscribe [2].