❓ Define tokenizer from `tokenizers` as a `PreTrainedTokenizer`
See original GitHub issueHi there,
I defined a simple whitespace tokenizer using the tokenizers
library and I would like to integrate it with the transformers ecosystem. As an example, I would like to be able to use it with the DataCollatorWithPadding
. Is there a way to easily (i.e., non-hacky) integrate tokenizers from tokenizers
library and the PreTrainedTokenizer
class?
For reference, please find below the code for the whitespace tokenizer.
Thanks a lot in advance for your help.
Best, Pietro
class WordTokenizer: # <- Maybe subclassing here?
def __init__(self, max_vocab_size=30_000, unk_token="[UNK]", pad_token="[PAD]"):
self.max_vocab_size = max_vocab_size
self.unk_token = unk_token
self.pad_token = pad_token
self.tokenizer, self.trainer = self._build_tokenizer()
os.environ["TOKENIZERS_PARALLELISM"] = "true"
def _build_tokenizer(self):
tokenizer = Tokenizer(WordLevel(unk_token=self.unk_token))
tokenizer.normalizer = BertNormalizer()
tokenizer.pre_tokenizer = Sequence([Digits(), Punctuation(), WhitespaceSplit()])
trainer = WordLevelTrainer(vocab_size=self.max_vocab_size, special_tokens=[self.pad_token, self.unk_token])
return tokenizer, trainer
def __call__(self, text_column, batch):
return {"input_ids": [enc.ids for enc in self.tokenizer.encode_batch(batch[text_column])]}
@staticmethod
def _batch_iterator(hf_dataset, batch_size, text_column):
for i in range(0, len(hf_dataset), batch_size):
yield hf_dataset[i : i + batch_size][text_column]
def fit(self, hf_dataset, batch_size=1_000, text_column="text"):
self.tokenizer.train_from_iterator(
self._batch_iterator(hf_dataset, batch_size, text_column), trainer=self.trainer, length=len(hf_dataset)
)
self.vocab_size = self.tokenizer.get_vocab_size()
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Tokenizer - Hugging Face
Tokenizer. A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of...
Read more >Get your own tokenizer with 🤗 Transformers & 🤗 Tokenizers
Lucile teaches us how to build and train a custom tokenizer and how to use in Transformers.Lucile is a machine learning engineer at...
Read more >Create a Tokenizer and Train a Huggingface RoBERTa Model ...
Train a Tokenizer. The Stanford NLP group define the tokenization as: “Given a character sequence and a defined document unit, tokenization is ...
Read more >Difference between the Tokenizer and the ... - hungsblog
The Tokenizer and PreTrainedTokenizer classes perform different roles. The Tokenizer is a pipeline and defines the actual tokenization, ...
Read more >What is tokenizer.max len doing in this class definition?
... df, block_size) 8 def __init__(self, tokenizer: PreTrainedTokenizer, ... what Tokenize.max_len is supposed to do so I can try to fix it:
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thank you very much for your answer @pietrolesci. I’m glad to read your solution, it’s always very interesting to see how you use the libraries and what difficulties you’re facing!
Thanks for the feedback @pietrolesci ! 🤗
It makes me think that maybe we should explain this point in the documentation shared by LysandreJik because indeed
PreTrainedTokenizer
has no way to automatically know which tokens of the tokenizer correspond to theunk_token
,cls_token
etc.But if you ever see an automatic way to do it, I’d be really happy to discuss it!