Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add Support for "No Language Left Behind" (NLLB)

See original GitHub issue

Model description

Hi,

Meta recently released another cool project called “No Language Left Behind” (NLLB):

No Language Left Behind (NLLB) is a first-of-its-kind, AI breakthrough project that open-sources models capable of delivering high-quality translations directly between any pair of 200+ languages — including low-resource languages like Asturian, Luganda, Urdu and more. It aims to help people communicate with anyone, anywhere, regardless of their language preferences.

The project itself is integrated into fairseq library and available on the nllb branch:

https://github.com/facebookresearch/fairseq/tree/nllb

It includes code release as well as released checkpoints.

A detailed 190 page paper is also available from here.

We should really add support for these amazing project by adding support for NLLB.

Open source status

The model implementation is available
The model weights are available

Provide useful links for the implementation

Models checkpoint are available here:

Model Name	Model Type	#params	checkpoint	metrics
NLLB-200	MoE	54.5B	model	metrics
NLLB-200	Dense	3.3B	model	metrics
NLLB-200	Dense	1.3B	model	metrics
NLLB-200-Distilled	Dense	1.3B	model	metrics
NLLB-200-Distilled	Dense	600M	model	metrics

Maintainers are: @vedanuj, @shruti-bh, @annasun28, @elbayadm, @jeanm, @jhcross, @kauterry and @huihuifan.

Implementation is available in the fairseq repo: https://github.com/facebookresearch/fairseq/tree/nllb

Issue Analytics

State:
Created a year ago
Reactions:22
Comments:17 (1 by maintainers)

Top GitHub Comments

13reactions

jhcrosscommented, Jul 6, 2022

Hi, I’m one of the Meta engineers who worked on NLLB, and I’m happy to support this from our side. That’s indeed the correct (real) SPM model for the vocabulary used for input/output, but internally the model’s vocabulary (and embedding table) size is supplemented at the end by a token for each language, which happens here:

https://github.com/facebookresearch/fairseq/blob/26d62ae8fbf3deccf01a138d704be1e5c346ca9a/fairseq/data/multilingual/multilingual_utils.py#L64

This list of languages come from an input arg which reads them from a string or file. For these particular models that value is:

https://github.com/facebookresearch/fairseq/blob/26d62ae8fbf3deccf01a138d704be1e5c346ca9a/examples/nllb/modeling/scripts/flores200/langs.txt#L1

Please let me know if you have any questions about this or if I can be of any further help.

9reactions

LysandreJikcommented, Jul 13, 2022

Thanks for opening an issue! We’ve managed to convert the models to the M2M_100 architecture and the tokenizers to a new NLLB tokenizer very closely resembling that of the mBART tokenizer.

We’re in the process of testing all models for generation and performance and I’ll likely open a PR in a few hours.

Top Results From Across the Web

No Language Left Behind - Meta AI Research Topic

Our work aims to break down language barriers across the world for everyone to understand and communicate with anyone—no matter what language they...

Meta AI: No Language Left Behind

We've created a demo that uses the latest AI advancements from the No Language Left Behind project to translate books from their languages...

No Language Left Behind (NLLB) - Zeta Alpha

No Language Left Behind (NLLB). 200 languages within a single AI model: A breakthrough in high-quality machine translation for low resource languages.

Language Translation Using Meta AI NLLB ... - Cobus Greyling

Language Translation Using Meta AI NLLB (No Language Left Behind) And SMS. The Meta AI NLLB project has open-sourced models, capable of performing...

NLLB - Hugging Face

In No Language Left Behind, we took on this challenge by first contextualizing the need for low-resource language translation support through exploratory ...