Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make all models folder independent from each other

See original GitHub issue

Transformers has a Do Repeat Yourself policy in the sense that it does not provide building blocks that we then mix and match, but we strive to have each model be self-contained in terms of code, at the price of code duplication. You can find more about this philosophy in this blog post.

There are instances in the library (mostly with older models) where this is not respected. This issue will serve as a tracker for all those instances, so that the library is cleaner and each model/tokenizer/config is easier to tweak by itself. This will also make it easier for us to test individual models in autonomy.

If you wish to make a contribution to Transformers, you can help! Pick a config/model/tokenizer in the list below (double-check someone is not working on it already by searching this page!) and indicate with a comment that wish to work on it. Read our contributing guide as well as the section below, and once you are ready, open a PR and tag @sgugger on it.

How to remove a dependency from another model

There are two different types of dependencies: either a configuration/model/tokenizer uses an intermediate object from another model (example: some tokenizer uses the BasicTokenizer defined in the tokenization_bert module, or it subclasses another configuration/model/tokenizer.

In the first case, the object code should just be copied inside the file, with a “Copied from” statement. This will make sure that code is always kept up to date even if the basic object is modified. For instance, if a tokenizer is using BasicTokenizer, go copy the code in tokenization_bert for that class, then paste it in the tokenizer module you are treating and add the following copied from comment:

# Copied from transformers.models.bert.tokenization_bert.BasicTokenizer
class BasicTokenizer(object):
...

In the second case, the code of the class (and all its building blocks) should be copied and renamed to be prefixed by the model: for instance if you are copying code from the modeling_bert module to build Roberta, you replace all BertLayer, BertOutput etc… by RobertaLayer, RobertaOutput… You should then add a copied from statement (when the copy is without any modification) like this one:

# Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Roberta
class RobertaAttention(nn.Module):
...

Note the replacement pattern that will adapt all names used. Note that:

you can add more of those patterns, separated by a comma like here.
you can ask to replace all possible casings like here
you can just copy one method and not the whole class like here

NB: No need for copied from statements in the config (the defaults are probably different anyway).

Objects to cover

Configurations

Flaubert config (should not use XLM)
LayoutLM config (should not use Bert)
LongformerConfig (should not use Roberta)
MarkupLMConfig (should not Roberta)
RobertaConfig (should not use Bert)
XLM-ProphetNet config (should not use ProphetNet)
XLM-Roberta config (should not use Roberta)

Models

BertGeneration (should not use BertEncoder)
Camembert (should not use Roberta) (PyTorch + TF)
Flaubert (should not use XLM) (PyTorch + TF)
mT5 (should not use T5)
XLM-ProphetNet (should not use ProphetNet)
Xlm-Roberta (should not use Roberta)

Tokenizers

BertJapanese (should not use any imports from tokenization bert)
Blenderbot (should not use Roberta) (slow/fast)
Clip (should not use BasicTokenizer from Bert)
ConvBERT (should not use Bert) (slow/fast)
Cpm tokenizer (should not use XLNet) (slow/fast)
Derberta tokenizer (should not use GPT2) (slow/fast)
DistilBert (should not use Bert) (slow/fast)
Electra (should not use Bert) (fast)
Flaubert (should not use XLM)
Funnel (should not use Bert) (slow/fast)
Herbert (should not BasicTokenizer from Bert and XLM)
LayoutLM (should not use Bert) (slow/fast)
LED (should not use BART) (slow/fast)
Longformer (should not use Roberta) (fast tokenizer)
Luke (should not use Roberta)
Lxmert (should not use Bert) (slow/fast)
MobileBert (should not use Bert) (slow/fast)
Openai-GPT (should not use BasicTokenizer from Bert)
ProphetNet (should not use BasicTokenzier and WordPieceTokenizer from Bert)
Retribert tokenizer (should not use Bert) (slow/fast)
Roformer tokenizer (should not use any imports from tokenization bert)
Squeezebert tokenizer (should not use Bert) (slow/fast)

Issue Analytics

State:
Created a year ago
Reactions:11
Comments:44 (38 by maintainers)

Top GitHub Comments

2reactions

Threepointone4commented, Oct 11, 2022

@sgugger I would like to contribute fast tokenizers ELECTRA and Longformer.

Edit : @sirmammingtonham I missed your message. I can take ELECTRA, you can take Longformer ?

2reactions

arnaudstieglercommented, Oct 3, 2022

Hello! Happy to take LayoutLM Config and Tokenizer 😃

Top Results From Across the Web

How do I separate my models out in django? - Stack Overflow

It is possible, just make sure to import all the models you create in __init__.py in your models directory. In your case, it...

Can we have 2 folders under Model folder for 2 dbs - MSDN

Yeah you can have two folders in the Models folder for two different databases. Creating a folder using the Visual Stuido 'Create new...

How to handle large number of models? - Laracasts

You can put models under another user defined folder, just namespace correctly. Taylor says. Copy Code Models typically live in the app directory,...

The Most Efficient Way to Organize Dbt Models

The team at dbt recommends organizing your models into two different folders- staging and marts. Staging models are those that read from a...

The Best Way to Organize IMPORTED MODELS in ... - YouTube

This also allows you to make copies of the entire folder and share them with colleagues (or save them on a shared drive)...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Make all models folder independent from each other

How to remove a dependency from another model

Objects to cover

Configurations

Models

Tokenizers

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

ONNX conversion of deberta_v2 models

deberta-v3 has 100 more vocabs than its tokenizer