Make all models folder independent from each other
See original GitHub issueTransformers has a Do Repeat Yourself policy in the sense that it does not provide building blocks that we then mix and match, but we strive to have each model be self-contained in terms of code, at the price of code duplication. You can find more about this philosophy in this blog post.
There are instances in the library (mostly with older models) where this is not respected. This issue will serve as a tracker for all those instances, so that the library is cleaner and each model/tokenizer/config is easier to tweak by itself. This will also make it easier for us to test individual models in autonomy.
If you wish to make a contribution to Transformers, you can help! Pick a config/model/tokenizer in the list below (double-check someone is not working on it already by searching this page!) and indicate with a comment that wish to work on it. Read our contributing guide as well as the section below, and once you are ready, open a PR and tag @sgugger on it.
How to remove a dependency from another model
There are two different types of dependencies: either a configuration/model/tokenizer uses an intermediate object from another model (example: some tokenizer uses the BasicTokenizer
defined in the tokenization_bert
module, or it subclasses another configuration/model/tokenizer.
In the first case, the object code should just be copied inside the file, with a “Copied from” statement. This will make sure that code is always kept up to date even if the basic object is modified. For instance, if a tokenizer is using BasicTokenizer
, go copy the code in tokenization_bert
for that class, then paste it in the tokenizer module you are treating and add the following copied from comment:
# Copied from transformers.models.bert.tokenization_bert.BasicTokenizer
class BasicTokenizer(object):
...
In the second case, the code of the class (and all its building blocks) should be copied and renamed to be prefixed by the model: for instance if you are copying code from the modeling_bert module to build Roberta, you replace all BertLayer
, BertOutput
etc… by RobertaLayer
, RobertaOutput
…
You should then add a copied from statement (when the copy is without any modification) like this one:
# Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Roberta
class RobertaAttention(nn.Module):
...
Note the replacement pattern that will adapt all names used. Note that:
- you can add more of those patterns, separated by a comma like here.
- you can ask to replace all possible casings like here
- you can just copy one method and not the whole class like here
NB: No need for copied from statements in the config (the defaults are probably different anyway).
Objects to cover
Configurations
- Flaubert config (should not use XLM)
- LayoutLM config (should not use Bert)
- LongformerConfig (should not use Roberta)
- MarkupLMConfig (should not Roberta)
- RobertaConfig (should not use Bert)
- XLM-ProphetNet config (should not use ProphetNet)
- XLM-Roberta config (should not use Roberta)
Models
- BertGeneration (should not use BertEncoder)
- Camembert (should not use Roberta) (PyTorch + TF)
- Flaubert (should not use XLM) (PyTorch + TF)
- mT5 (should not use T5)
- XLM-ProphetNet (should not use ProphetNet)
- Xlm-Roberta (should not use Roberta)
Tokenizers
- BertJapanese (should not use any imports from tokenization bert)
- Blenderbot (should not use Roberta) (slow/fast)
- Clip (should not use BasicTokenizer from Bert)
- ConvBERT (should not use Bert) (slow/fast)
- Cpm tokenizer (should not use XLNet) (slow/fast)
- Derberta tokenizer (should not use GPT2) (slow/fast)
- DistilBert (should not use Bert) (slow/fast)
- Electra (should not use Bert) (fast)
- Flaubert (should not use XLM)
- Funnel (should not use Bert) (slow/fast)
- Herbert (should not BasicTokenizer from Bert and XLM)
- LayoutLM (should not use Bert) (slow/fast)
- LED (should not use BART) (slow/fast)
- Longformer (should not use Roberta) (fast tokenizer)
- Luke (should not use Roberta)
- Lxmert (should not use Bert) (slow/fast)
- MobileBert (should not use Bert) (slow/fast)
- Openai-GPT (should not use BasicTokenizer from Bert)
- ProphetNet (should not use BasicTokenzier and WordPieceTokenizer from Bert)
- Retribert tokenizer (should not use Bert) (slow/fast)
- Roformer tokenizer (should not use any imports from tokenization bert)
- Squeezebert tokenizer (should not use Bert) (slow/fast)
Issue Analytics
- State:
- Created a year ago
- Reactions:11
- Comments:44 (38 by maintainers)
Top GitHub Comments
@sgugger I would like to contribute fast tokenizers
ELECTRA
andLongformer
.Edit : @sirmammingtonham I missed your message. I can take
ELECTRA
, you can takeLongformer
?Hello! Happy to take LayoutLM Config and Tokenizer 😃