Using whitespace tokenizer for training models
See original GitHub issueEnvironment info
transformers
version: 4.6.1- Platform: Linux-5.4.109±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.10
- PyTorch version (GPU?): 1.8.1+cu101 (False)
- Tensorflow version (GPU?): 2.5.0 (False)
- Using GPU in script?: Yes/depends
- Using distributed or parallel set-up in script?: No
Who can help
- longformer, reformer, transfoxl, xlnet: @patrickvonplaten
- tokenizers: @LysandreJik
Information
Model I am using (Bert, XLNet …): BigBird
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
I have a dataset for which I wanted to use a tokenizer based on whitespace rather than any subword segmentation approach.
This snippet I got off github has a way to construct and use the custom tokenizer that operates on whitespaces:-
from tokenizers import Tokenizer, trainers
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import CharDelimiterSplit
# We build our custom tokenizer:
tokenizer = Tokenizer(BPE())
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = CharDelimiterSplit(' ')
# We can train this tokenizer by giving it a list of path to text files:
trainer = trainers.BpeTrainer(special_tokens=["[UNK]"], show_progress=True)
tokenizer.train(files=['/content/dataset.txt'], trainer=trainer)
I wanted to use it for pre-training the BigBird
model, but facing two issues:
- I can’t seem to be able to use this snippet with the custom
tokenizer
above to convert tokenized sentences in model-friendly sequences
from tokenizers.processors import BertProcessing
tokenizer._tokenizer.post_processor = tokenizers.processors.BertProcessing(
("</s>", tokenizer.token_to_id("</s>")),
("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=16000)
This returns me an error, and without any preprocessing the output does not contain the sequence start and end tokens (<s>
; </s>
) as expected.
- Next problem arises, when I save the tokenizer state in the specified folder, I am unable to use it via:
tokenizer = BigBirdTokenizerFast.from_pretrained("./tok", max_len=16000)
since it yields the error that my directory does not ‘reference’ the tokenizer files, which shouldn’t be an issue since using RobertaTokenizerFast
does work - I assume it has something to do in the tokenization post-processing
phase.
Fully Reproducible Colab
I am really confused about this - I have created a fully reproducible colab notebook, with commented problems and synthetic data. Please find it here.
Thanx a ton in advance!!
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
Thanks a ton @LysandreJik and replying so quickly and efficiently 🍰 👍 🚀 !!!
For anyone else who might stumble on this problem, I have modified a simple example via the Colab link attached above. If in any case it might not be working, I have uploaded the
.ipynb
file alongside this comment too. 🤗Have a fantastic day!
HF_issue_repro.zip
Hey @LysandreJik, Thanks a ton for the tips, I will surely try them if I face this error again! 🤗
I am using the
master
branch now for my project, so I hope I won’t face this problem again. However, I can’t completely verify whether it works because I am unable to run it on TPU due to some memory leak.If related problems arise, I would surely try out either of your fixes 🚀
Have a fantastic day!