question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Make all models folder independent from each other

See original GitHub issue

Transformers has a Do Repeat Yourself policy in the sense that it does not provide building blocks that we then mix and match, but we strive to have each model be self-contained in terms of code, at the price of code duplication. You can find more about this philosophy in this blog post.

There are instances in the library (mostly with older models) where this is not respected. This issue will serve as a tracker for all those instances, so that the library is cleaner and each model/tokenizer/config is easier to tweak by itself. This will also make it easier for us to test individual models in autonomy.

If you wish to make a contribution to Transformers, you can help! Pick a config/model/tokenizer in the list below (double-check someone is not working on it already by searching this page!) and indicate with a comment that wish to work on it. Read our contributing guide as well as the section below, and once you are ready, open a PR and tag @sgugger on it.

How to remove a dependency from another model

There are two different types of dependencies: either a configuration/model/tokenizer uses an intermediate object from another model (example: some tokenizer uses the BasicTokenizer defined in the tokenization_bert module, or it subclasses another configuration/model/tokenizer.

In the first case, the object code should just be copied inside the file, with a “Copied from” statement. This will make sure that code is always kept up to date even if the basic object is modified. For instance, if a tokenizer is using BasicTokenizer, go copy the code in tokenization_bert for that class, then paste it in the tokenizer module you are treating and add the following copied from comment:

# Copied from transformers.models.bert.tokenization_bert.BasicTokenizer
class BasicTokenizer(object):
...

In the second case, the code of the class (and all its building blocks) should be copied and renamed to be prefixed by the model: for instance if you are copying code from the modeling_bert module to build Roberta, you replace all BertLayer, BertOutput etc… by RobertaLayer, RobertaOutput… You should then add a copied from statement (when the copy is without any modification) like this one:

# Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Roberta
class RobertaAttention(nn.Module):
...

Note the replacement pattern that will adapt all names used. Note that:

  • you can add more of those patterns, separated by a comma like here.
  • you can ask to replace all possible casings like here
  • you can just copy one method and not the whole class like here

NB: No need for copied from statements in the config (the defaults are probably different anyway).

Objects to cover

Configurations

  • Flaubert config (should not use XLM)
  • LayoutLM config (should not use Bert)
  • LongformerConfig (should not use Roberta)
  • MarkupLMConfig (should not Roberta)
  • RobertaConfig (should not use Bert)
  • XLM-ProphetNet config (should not use ProphetNet)
  • XLM-Roberta config (should not use Roberta)

Models

  • BertGeneration (should not use BertEncoder)
  • Camembert (should not use Roberta) (PyTorch + TF)
  • Flaubert (should not use XLM) (PyTorch + TF)
  • mT5 (should not use T5)
  • XLM-ProphetNet (should not use ProphetNet)
  • Xlm-Roberta (should not use Roberta)

Tokenizers

  • BertJapanese (should not use any imports from tokenization bert)
  • Blenderbot (should not use Roberta) (slow/fast)
  • Clip (should not use BasicTokenizer from Bert)
  • ConvBERT (should not use Bert) (slow/fast)
  • Cpm tokenizer (should not use XLNet) (slow/fast)
  • Derberta tokenizer (should not use GPT2) (slow/fast)
  • DistilBert (should not use Bert) (slow/fast)
  • Electra (should not use Bert) (fast)
  • Flaubert (should not use XLM)
  • Funnel (should not use Bert) (slow/fast)
  • Herbert (should not BasicTokenizer from Bert and XLM)
  • LayoutLM (should not use Bert) (slow/fast)
  • LED (should not use BART) (slow/fast)
  • Longformer (should not use Roberta) (fast tokenizer)
  • Luke (should not use Roberta)
  • Lxmert (should not use Bert) (slow/fast)
  • MobileBert (should not use Bert) (slow/fast)
  • Openai-GPT (should not use BasicTokenizer from Bert)
  • ProphetNet (should not use BasicTokenzier and WordPieceTokenizer from Bert)
  • Retribert tokenizer (should not use Bert) (slow/fast)
  • Roformer tokenizer (should not use any imports from tokenization bert)
  • Squeezebert tokenizer (should not use Bert) (slow/fast)

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:11
  • Comments:44 (38 by maintainers)

github_iconTop GitHub Comments

2reactions
Threepointone4commented, Oct 11, 2022

@sgugger I would like to contribute fast tokenizers ELECTRA and Longformer.

Edit : @sirmammingtonham I missed your message. I can take ELECTRA, you can take Longformer ?

2reactions
arnaudstieglercommented, Oct 3, 2022

Hello! Happy to take LayoutLM Config and Tokenizer 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

How do I separate my models out in django? - Stack Overflow
It is possible, just make sure to import all the models you create in __init__.py in your models directory. In your case, it...
Read more >
Can we have 2 folders under Model folder for 2 dbs - MSDN
Yeah you can have two folders in the Models folder for two different databases. Creating a folder using the Visual Stuido 'Create new...
Read more >
How to handle large number of models? - Laracasts
You can put models under another user defined folder, just namespace correctly. Taylor says. Copy Code Models typically live in the app directory,...
Read more >
The Most Efficient Way to Organize Dbt Models
The team at dbt recommends organizing your models into two different folders- staging and marts. Staging models are those that read from a...
Read more >
The Best Way to Organize IMPORTED MODELS in ... - YouTube
This also allows you to make copies of the entire folder and share them with colleagues (or save them on a shared drive)...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found