[Question]: Register new Tokenizer
See original GitHub issueHi there,
I’m in the process of creating a new Transformer model. I have my own codebase and I’m using Transformers as an external library. If I implement a new Tokenizer that inherits from an existing one (say the BERT one) is there any way to “register” my new tokenizer so that Huggingface automatically instantiate it? I would like to support the AutoTokenizer
API:
tokenizer = AutoTokenizer.from_pretrained("heriot-watt/my_model_name")
And I would like that AutoTokenizer
looks in my PYTHONPATH and automatically resolves the Tokenizer
class with the name my_model_name
. I’ve seen that currently, Transformers uses a hardcoded resolution strategy defined in configuration_auto.py
or tokenization_auto.py
. For instance, AllenNLP uses a nice register annotation to automatically resolve models, dataset reader and so on. What would be the best solution here?
Thanks for your answer, Alessandro
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:8 (7 by maintainers)
Top GitHub Comments
Thanks for showing us how you do it! That’s a very interesting usage of the AutoModels, and definitely something we would be interested in adding. For example via a
transformers.register_auto_model(xxx)
or something along those lines.So I take you’re not planning to have an automatic module discovery. I see. Anyway, I feel like an equally nice way to solve this is to have a folder on your current path called
heriot-watt/my_model_name
. In it, I have my config files and tokenizer files that belong to theTokenizer
I’m inheriting from. Then, In my package__init__.py
I had to add the following:In this way, I’m able to use the
Auto*
API just fine 😃