Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inference for TFMarianMTModel (en to Romance language translation) is slow and inaccurate

See original GitHub issue

System Info

System macOS Monterey 12.2.1

transformers==4.20.1
tensorflow==2.9.1

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

from transformers import TFMarianMTModel, MarianTokenizer
model_name = "Helsinki-NLP/opus-mt-en-ROMANCE"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = TFMarianMTModel.from_pretrained(model_name)
text_in = ['>>fr<< hello']
batch = tokenizer(text_in, return_tensors='tf', padding=True)
translated = model.generate(**batch)

Output:

- Qu'est-ce qu'il y a, là-bas, là-bas, là---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Expected behavior

I would expect similar performance to the PyTorch model.

Inference requires about 120s on my machine and outputs an incorrect translation. In contrast, the PyTorch model (replacing TFMarianMTModel with MarianMTModel and changing return_tensors to 'pt' in the code snippet) returns the correct translation (“Bonjour”) and inference requires about 6s on my machine.

Issue Analytics

State:
Created a year ago
Comments:11 (3 by maintainers)

Top GitHub Comments

2reactions

gantecommented, Aug 24, 2022

Hi there @ydshieh @danielenricocahall 👋

None of the Marian models can be successfully converted to TF – they all fail when validating the hidden layers and outputs of the models. This is a shame since there are a ton of Marian models for translation 😦

This means there is something wrong with either the model architecture or with weight cross-loading. I haven’t looked into it, other than noticing the issue when attempting to convert the weights from Helsinki-NLP

1reaction

jamie0725commented, Nov 3, 2022

@ydshieh Hi, I am experiencing the same issue. Expected the TF version would be faster than the PT version.

Top Results From Across the Web

MarianMT - Hugging Face

The language codes used to name models are inconsistent. Two digit codes can usually be found here, three digit codes require googling “language...

AmericasNLI: Machine translation and natural language ...

To explore this question, we present AmericasNLI, a natural language inference dataset covering 10 Indigenous languages of the Americas. We conduct experiments ...

Text Data Augmentation with MarianMT - Amit Chaudhary

Next, we write a helper function to translate a batch of text given the machine translation model, tokenizer and the target romance language....

Translate Any Two Languages in 60 Lines of Python

All that was to translate one language to one other language. ... the tokenizer model = MarianMTModel.from_pretrained(model_name) tokenizer ...

Hugging Face Pre-trained Models: Find the Best One for Your ...

It is the task of translating a text from one language to another. ... Learning and a lot of research has been done...