Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question]: Register new Tokenizer

See original GitHub issue

Hi there,

I’m in the process of creating a new Transformer model. I have my own codebase and I’m using Transformers as an external library. If I implement a new Tokenizer that inherits from an existing one (say the BERT one) is there any way to “register” my new tokenizer so that Huggingface automatically instantiate it? I would like to support the AutoTokenizer API:

tokenizer = AutoTokenizer.from_pretrained("heriot-watt/my_model_name")

And I would like that AutoTokenizer looks in my PYTHONPATH and automatically resolves the Tokenizer class with the name my_model_name. I’ve seen that currently, Transformers uses a hardcoded resolution strategy defined in configuration_auto.py or tokenization_auto.py. For instance, AllenNLP uses a nice register annotation to automatically resolve models, dataset reader and so on. What would be the best solution here?

Thanks for your answer, Alessandro

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:8 (7 by maintainers)

Top GitHub Comments

2reactions

LysandreJikcommented, Feb 25, 2021

Thanks for showing us how you do it! That’s a very interesting usage of the AutoModels, and definitely something we would be interested in adding. For example via a transformers.register_auto_model(xxx) or something along those lines.

2reactions

aleSugliacommented, Feb 18, 2021

So I take you’re not planning to have an automatic module discovery. I see. Anyway, I feel like an equally nice way to solve this is to have a folder on your current path called heriot-watt/my_model_name. In it, I have my config files and tokenizer files that belong to the Tokenizer I’m inheriting from. Then, In my package __init__.py I had to add the following:

MODEL_MAPPING.update({
    MyModelConfig: MyModel
})

CONFIG_MAPPING.update({
    "my_model": MyModelConfig
})

TOKENIZER_MAPPING.update({
    MyModelConfig: (MyModelTokenizer, MyModelTokenizerFast)
})

MODEL_NAMES_MAPPING.update({
    "my_model_name": "MyModel"
})

In this way, I’m able to use the Auto* API just fine 😃