question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

❓ Define tokenizer from `tokenizers` as a `PreTrainedTokenizer`

See original GitHub issue

Hi there,

I defined a simple whitespace tokenizer using the tokenizers library and I would like to integrate it with the transformers ecosystem. As an example, I would like to be able to use it with the DataCollatorWithPadding. Is there a way to easily (i.e., non-hacky) integrate tokenizers from tokenizers library and the PreTrainedTokenizer class?

For reference, please find below the code for the whitespace tokenizer.

Thanks a lot in advance for your help.

Best, Pietro

class WordTokenizer:  # <- Maybe subclassing here?
    def __init__(self, max_vocab_size=30_000, unk_token="[UNK]", pad_token="[PAD]"):
        self.max_vocab_size = max_vocab_size
        self.unk_token = unk_token
        self.pad_token = pad_token
        self.tokenizer, self.trainer = self._build_tokenizer()
        os.environ["TOKENIZERS_PARALLELISM"] = "true"

    def _build_tokenizer(self):
        tokenizer = Tokenizer(WordLevel(unk_token=self.unk_token))
        tokenizer.normalizer = BertNormalizer()
        tokenizer.pre_tokenizer = Sequence([Digits(), Punctuation(), WhitespaceSplit()])
        trainer = WordLevelTrainer(vocab_size=self.max_vocab_size, special_tokens=[self.pad_token, self.unk_token])
        return tokenizer, trainer

    def __call__(self, text_column, batch):
        return {"input_ids": [enc.ids for enc in self.tokenizer.encode_batch(batch[text_column])]}

    @staticmethod
    def _batch_iterator(hf_dataset, batch_size, text_column):
        for i in range(0, len(hf_dataset), batch_size):
            yield hf_dataset[i : i + batch_size][text_column]

    def fit(self, hf_dataset, batch_size=1_000, text_column="text"):
        self.tokenizer.train_from_iterator(
            self._batch_iterator(hf_dataset, batch_size, text_column), trainer=self.trainer, length=len(hf_dataset)
        )
        self.vocab_size = self.tokenizer.get_vocab_size()

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
SaulLucommented, Nov 24, 2021

Thank you very much for your answer @pietrolesci. I’m glad to read your solution, it’s always very interesting to see how you use the libraries and what difficulties you’re facing!

1reaction
SaulLucommented, Nov 24, 2021

Thanks for the feedback @pietrolesci ! 🤗

It makes me think that maybe we should explain this point in the documentation shared by LysandreJik because indeed PreTrainedTokenizer has no way to automatically know which tokens of the tokenizer correspond to the unk_token, cls_token etc.

But if you ever see an automatic way to do it, I’d be really happy to discuss it!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenizer - Hugging Face
Tokenizer. A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of...
Read more >
Get your own tokenizer with 🤗 Transformers & 🤗 Tokenizers
Lucile teaches us how to build and train a custom tokenizer and how to use in Transformers.Lucile is a machine learning engineer at...
Read more >
Create a Tokenizer and Train a Huggingface RoBERTa Model ...
Train a Tokenizer. The Stanford NLP group define the tokenization as: “Given a character sequence and a defined document unit, tokenization is ...
Read more >
Difference between the Tokenizer and the ... - hungsblog
The Tokenizer and PreTrainedTokenizer classes perform different roles. The Tokenizer is a pipeline and defines the actual tokenization, ...
Read more >
What is tokenizer.max len doing in this class definition?
... df, block_size) 8 def __init__(self, tokenizer: PreTrainedTokenizer, ... what Tokenize.max_len is supposed to do so I can try to fix it:
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found