question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question]: Register new Tokenizer

See original GitHub issue

Hi there,

I’m in the process of creating a new Transformer model. I have my own codebase and I’m using Transformers as an external library. If I implement a new Tokenizer that inherits from an existing one (say the BERT one) is there any way to “register” my new tokenizer so that Huggingface automatically instantiate it? I would like to support the AutoTokenizer API:

tokenizer = AutoTokenizer.from_pretrained("heriot-watt/my_model_name")

And I would like that AutoTokenizer looks in my PYTHONPATH and automatically resolves the Tokenizer class with the name my_model_name. I’ve seen that currently, Transformers uses a hardcoded resolution strategy defined in configuration_auto.py or tokenization_auto.py. For instance, AllenNLP uses a nice register annotation to automatically resolve models, dataset reader and so on. What would be the best solution here?

Thanks for your answer, Alessandro

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
LysandreJikcommented, Feb 25, 2021

Thanks for showing us how you do it! That’s a very interesting usage of the AutoModels, and definitely something we would be interested in adding. For example via a transformers.register_auto_model(xxx) or something along those lines.

2reactions
aleSugliacommented, Feb 18, 2021

So I take you’re not planning to have an automatic module discovery. I see. Anyway, I feel like an equally nice way to solve this is to have a folder on your current path called heriot-watt/my_model_name. In it, I have my config files and tokenizer files that belong to the Tokenizer I’m inheriting from. Then, In my package __init__.py I had to add the following:

MODEL_MAPPING.update({
    MyModelConfig: MyModel
})

CONFIG_MAPPING.update({
    "my_model": MyModelConfig
})

TOKENIZER_MAPPING.update({
    MyModelConfig: (MyModelTokenizer, MyModelTokenizerFast)
})

MODEL_NAMES_MAPPING.update({
    "my_model_name": "MyModel"
})

In this way, I’m able to use the Auto* API just fine 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training a new tokenizer from an old one - Hugging Face
Training a tokenizer is a statistical process that tries to identify which subwords are the best to pick for a given corpus, and...
Read more >
[Bug] setTokenProvider register new tokenizer, but ... - GitHub
Register a new language monaco.languages.register({ id: 'mySpecialLanguage' }); monaco.languages.setTokensProvider("mySpecialLanguage" ...
Read more >
Registering a new FTS tokenizer in SQLite3 w/ Python
I'm building an application which requires a custom Tokenizer in its FTS database. I have found a Tokenizer which does what I want...
Read more >
Tokenizer · spaCy API Documentation
Tokenize a string with a slow debugging tokenizer that provides information about which tokenizer rule or pattern was matched for each token. The...
Read more >
Tokenization in NLP: Types, Challenges, Examples, Tools
It has an important effect on the rest of your pipeline. A tokenizer breaks unstructured data and natural language text into chunks of ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found