Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

❓ Define tokenizer from `tokenizers` as a `PreTrainedTokenizer`

See original GitHub issue

Hi there,

I defined a simple whitespace tokenizer using the tokenizers library and I would like to integrate it with the transformers ecosystem. As an example, I would like to be able to use it with the DataCollatorWithPadding. Is there a way to easily (i.e., non-hacky) integrate tokenizers from tokenizers library and the PreTrainedTokenizer class?

For reference, please find below the code for the whitespace tokenizer.

Thanks a lot in advance for your help.

Best, Pietro

class WordTokenizer:  # <- Maybe subclassing here?
    def __init__(self, max_vocab_size=30_000, unk_token="[UNK]", pad_token="[PAD]"):
        self.max_vocab_size = max_vocab_size
        self.unk_token = unk_token
        self.pad_token = pad_token
        self.tokenizer, self.trainer = self._build_tokenizer()
        os.environ["TOKENIZERS_PARALLELISM"] = "true"

    def _build_tokenizer(self):
        tokenizer = Tokenizer(WordLevel(unk_token=self.unk_token))
        tokenizer.normalizer = BertNormalizer()
        tokenizer.pre_tokenizer = Sequence([Digits(), Punctuation(), WhitespaceSplit()])
        trainer = WordLevelTrainer(vocab_size=self.max_vocab_size, special_tokens=[self.pad_token, self.unk_token])
        return tokenizer, trainer

    def __call__(self, text_column, batch):
        return {"input_ids": [enc.ids for enc in self.tokenizer.encode_batch(batch[text_column])]}

    @staticmethod
    def _batch_iterator(hf_dataset, batch_size, text_column):
        for i in range(0, len(hf_dataset), batch_size):
            yield hf_dataset[i : i + batch_size][text_column]

    def fit(self, hf_dataset, batch_size=1_000, text_column="text"):
        self.tokenizer.train_from_iterator(
            self._batch_iterator(hf_dataset, batch_size, text_column), trainer=self.trainer, length=len(hf_dataset)
        )
        self.vocab_size = self.tokenizer.get_vocab_size()

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

SaulLucommented, Nov 24, 2021

Thank you very much for your answer @pietrolesci. I’m glad to read your solution, it’s always very interesting to see how you use the libraries and what difficulties you’re facing!

1reaction

SaulLucommented, Nov 24, 2021

Thanks for the feedback @pietrolesci ! 🤗

It makes me think that maybe we should explain this point in the documentation shared by LysandreJik because indeed PreTrainedTokenizer has no way to automatically know which tokens of the tokenizer correspond to the unk_token, cls_token etc.

But if you ever see an automatic way to do it, I’d be really happy to discuss it!

Top Results From Across the Web

Tokenizer - Hugging Face

Tokenizer. A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of...

Get your own tokenizer with 🤗 Transformers & 🤗 Tokenizers

Lucile teaches us how to build and train a custom tokenizer and how to use in Transformers.Lucile is a machine learning engineer at...

Create a Tokenizer and Train a Huggingface RoBERTa Model ...

Train a Tokenizer. The Stanford NLP group define the tokenization as: “Given a character sequence and a defined document unit, tokenization is ...

Difference between the Tokenizer and the ... - hungsblog

The Tokenizer and PreTrainedTokenizer classes perform different roles. The Tokenizer is a pipeline and defines the actual tokenization, ...

What is tokenizer.max len doing in this class definition?

... df, block_size) 8 def __init__(self, tokenizer: PreTrainedTokenizer, ... what Tokenize.max_len is supposed to do so I can try to fix it:

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

❓ Define tokenizer from `tokenizers` as a `PreTrainedTokenizer`

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

GPT model `generate()` function not correctly skipping the padding tokens indicated by `attention_mask`

DeBERTa-v3 does not preserve spaces before/after additional special tokens in convert_tokens_to_string output