Add Support for "No Language Left Behind" (NLLB)
See original GitHub issueModel description
Hi,
Meta recently released another cool project called “No Language Left Behind” (NLLB):
No Language Left Behind (NLLB) is a first-of-its-kind, AI breakthrough project that open-sources models capable of delivering high-quality translations directly between any pair of 200+ languages — including low-resource languages like Asturian, Luganda, Urdu and more. It aims to help people communicate with anyone, anywhere, regardless of their language preferences.
The project itself is integrated into fairseq
library and available on the nllb
branch:
https://github.com/facebookresearch/fairseq/tree/nllb
It includes code release as well as released checkpoints.
A detailed 190 page paper is also available from here.
We should really add support for these amazing project by adding support for NLLB.
Open source status
- The model implementation is available
- The model weights are available
Provide useful links for the implementation
Models checkpoint are available here:
Model Name | Model Type | #params | checkpoint | metrics |
---|---|---|---|---|
NLLB-200 | MoE | 54.5B | model | metrics |
NLLB-200 | Dense | 3.3B | model | metrics |
NLLB-200 | Dense | 1.3B | model | metrics |
NLLB-200-Distilled | Dense | 1.3B | model | metrics |
NLLB-200-Distilled | Dense | 600M | model | metrics |
Maintainers are: @vedanuj, @shruti-bh, @annasun28, @elbayadm, @jeanm, @jhcross, @kauterry and @huihuifan.
Implementation is available in the fairseq
repo: https://github.com/facebookresearch/fairseq/tree/nllb
Issue Analytics
- State:
- Created a year ago
- Reactions:22
- Comments:17 (1 by maintainers)
Top GitHub Comments
Hi, I’m one of the Meta engineers who worked on NLLB, and I’m happy to support this from our side. That’s indeed the correct (real) SPM model for the vocabulary used for input/output, but internally the model’s vocabulary (and embedding table) size is supplemented at the end by a token for each language, which happens here:
https://github.com/facebookresearch/fairseq/blob/26d62ae8fbf3deccf01a138d704be1e5c346ca9a/fairseq/data/multilingual/multilingual_utils.py#L64
This list of languages come from an input arg which reads them from a string or file. For these particular models that value is:
https://github.com/facebookresearch/fairseq/blob/26d62ae8fbf3deccf01a138d704be1e5c346ca9a/examples/nllb/modeling/scripts/flores200/langs.txt#L1
Please let me know if you have any questions about this or if I can be of any further help.
Thanks for opening an issue! We’ve managed to convert the models to the M2M_100 architecture and the tokenizers to a new NLLB tokenizer very closely resembling that of the mBART tokenizer.
We’re in the process of testing all models for generation and performance and I’ll likely open a PR in a few hours.