question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Windows: Can't find vocabulary file for MarianTokenizer

See original GitHub issue

🐛 Bug MarianTokenizer.from_pretrained() fails in Python 3.6.4 in Windows 10

Information

Occurs with using the example here: https://huggingface.co/transformers/model_doc/marian.html?highlight=marianmtmodel#transformers.MarianMTModel

Model I am using (Bert, XLNet …): MarianMTModel

Language I am using the model on (English, Chinese …): English

The problem arises when using:

  • [X ] the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [X ] my own task or dataset: (give details below)

To reproduce

Paste code from example and run:

from transformers import MarianTokenizer, MarianMTModel
from typing import List
src = 'fr'  # source language
trg = 'en'  # target language
sample_text = "où est l'arrêt de bus ?"
mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'

model = MarianMTModel.from_pretrained(mname)
tok = MarianTokenizer.from_pretrained(mname)
batch = tok.prepare_translation_batch(src_texts=[sample_text])  # don't need tgt_text for inference
gen = model.generate(**batch)  # for forward pass: model(**batch)
words: List[str] = tok.batch_decode(gen, skip_special_tokens=True)  # returns "Where is the the bus stop ?"
print(words)

Steps to reproduce the behavior:

  1. Run the example
  2. Program terminates on tok = MarianTokenizer.from_pretrained(mname)
stdbuf was not found; communication with perl may hang due to stdio buffering.
Traceback (most recent call last):
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1055, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 89, in __init__
    self._setup_normalizer()
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 95, in _setup_normalizer
    self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang)
  File "C:\Program Files\Python\lib\site-packages\mosestokenizer\punctnormalizer.py", line 47, in __init__
    super().__init__(argv)
  File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 64, in __init__
    self.start()
  File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 108, in start
    env=env,
  File "C:\Program Files\Python\lib\subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "C:\Program Files\Python\lib\subprocess.py", line 997, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Development/Research/COVID-19-Misinfo2/src/translate_test_2.py", line 9, in <module>
    tok = MarianTokenizer.from_pretrained(mname)
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 902, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1058, in _from_pretrained
    "Unable to load vocabulary from file. "
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

Expected behavior

prints [“Where is the the bus stop ?”]

Environment info

  • transformers version: 2.9.1
  • Platform: Windows-10-10.0.18362-SP0
  • Python version: 3.6.4
  • PyTorch version (GPU?): 1.5.0+cu101 (True)
  • Tensorflow version (GPU?): 2.1.0 (True)
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:17 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
pgfeldmancommented, Jun 30, 2020

Just upgraded to version 3.0, and everything is working!

0reactions
pgfeldmancommented, Jun 17, 2020

Hi Sam,

I just rebased, verified the gitlog, and installed using “pip install –upgrade .” I’m attaching the console record of the install.

I still get the same error(s)

2020-06-17 05:40:43.980254: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll stdbuf was not found; communication with perl may hang due to stdio buffering. Traceback (most recent call last): File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils_base.py”, line 1161, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py”, line 81, in init self._setup_normalizer() File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py”, line 87, in _setup_normalizer self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang) File “C:\Program Files\Python\lib\site-packages\mosestokenizer\punctnormalizer.py”, line 47, in init super().init(argv) File “C:\Program Files\Python\lib\site-packages\toolwrapper.py”, line 64, in init self.start() File “C:\Program Files\Python\lib\site-packages\toolwrapper.py”, line 108, in start env=env, File “C:\Program Files\Python\lib\subprocess.py”, line 709, in init restore_signals, start_new_session) File “C:\Program Files\Python\lib\subprocess.py”, line 997, in _execute_child startupinfo) FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File “C:/Users/Phil/AppData/Roaming/JetBrains/IntelliJIdea2020.1/scratches/transformers_error.py”, line 9, in <module> tok = MarianTokenizer.from_pretrained(mname) File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils_base.py”, line 1008, in from_pretrained return cls._from_pretrained(*inputs, **kwargs) File “C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils_base.py”, line 1164, in _from_pretrained "Unable to load vocabulary from file. " OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

Process finished with exit code 1

Hope this helps

Phil


On 2020-06-16 09:50, Sam Shleifer wrote:

I think this bug may be fixed on master, but I can’t verify because I don’t have windows. Could 1 person check and post their results? Remember to be up to date with master, your git log should contain 3d495c61e Sam Shleifer: Fix marian tokenizer save pretrained (#5043) - (HEAD -> master, upstream/master) (2 minutes ago)

– You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [1], or unsubscribe [2].

Links:

[1] https://github.com/huggingface/transformers/issues/4491#issuecomment-644778862 [2] https://github.com/notifications/unsubscribe-auth/ABPRJH5BKWN3OBT7DOP4PVTRW52CRANCNFSM4NGLYESA

Read more comments on GitHub >

github_iconTop Results From Across the Web

MarianMT - Hugging Face
We're on a journey to advance and democratize artificial intelligence through open source and open science.
Read more >
Documentation - Marian NMT
marian-vocab: creating a vocabulary from text given on STDIN. marian-conv: converting a model into a binary format. The amun tool offering CPU ...
Read more >
Huggingface AutoTokenizer can't load from local path
I'm trying to run language model finetuning script (run_language_modeling.py) from huggingface examples with my own tokenizer(just added in ...
Read more >
Hugging Face Pre-trained Models: Find the Best One for Your ...
Hugging Face is focused on Natural Language Processing(NLP) tasks and the idea is not to just recognize words but to understand the meaning...
Read more >
Training a Grammar Error Correction (GEC) Model ... - Medium
tokenizer = MarianTokenizer.from_pretrained('path/to/gec/model')) ... I wouldn't be writing this article, that I couldn't find one.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found