[example scripts] disambiguate language specification API
See original GitHub issueCurrently in example scripts like run_seq2seq.py
we have:
- for t5
--task translation_en_to_ro
--source_prefix "translate English to Romanian: "
- Also these 2:
--target_lang ro_RO
--source_lang en_XX
are used only for MBart and are ignored for other models. Which means that people will unknowingly try to use these two as well when they aren’t need.
The problem in both situations is that we provide error-prone API where a user wants to change the language and forgets that there is more than one of the same and changes only one of the sets of languages, but not the other, which leads to broken outcome.
If such an error is made the specification supplied by the user becomes ambiguous, because one can’t tell which of the multiple inputs takes precedence.
Proposal: There should be only one way to input a set of languages and not multiple ways.
Specifically:
- in case 1, probably the easiest is to leave
--task translation_en_to_ro
and auto-generate--source_prefix "translate English to Romanian: "
- in case 2, assert if
--target_lang
or--source_lang
are passed and the model is not MBart.
Thinking more about it, case 1 is a must to solve, because if a user misses --source_prefix
or makes a typo in it - the train/eval won’t fail, but will mysteriously produce really bad outcome. This is not user-friendly.
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (12 by maintainers)
Top GitHub Comments
we require running
pip install -r examples/seq2seq/requirements.txt
already, so why not follow suite.This is for the pre-trained models, but if a user provides their own model it could be any language.
Plus you have https://github.com/google-research/multilingual-t5.
I wonder if there is a python module that comes with such a map.