Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using whitespace tokenizer for training models

See original GitHub issue

Environment info

transformers version: 4.6.1
Platform: Linux-5.4.109±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.8.1+cu101 (False)
Tensorflow version (GPU?): 2.5.0 (False)
Using GPU in script?: Yes/depends
Using distributed or parallel set-up in script?: No

Who can help

longformer, reformer, transfoxl, xlnet: @patrickvonplaten
tokenizers: @LysandreJik

Information

Model I am using (Bert, XLNet …): BigBird

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

I have a dataset for which I wanted to use a tokenizer based on whitespace rather than any subword segmentation approach.

This snippet I got off github has a way to construct and use the custom tokenizer that operates on whitespaces:-

from tokenizers import Tokenizer, trainers
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import CharDelimiterSplit

# We build our custom tokenizer:
tokenizer = Tokenizer(BPE()) 
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = CharDelimiterSplit(' ')

# We can train this tokenizer by giving it a list of path to text files:
trainer = trainers.BpeTrainer(special_tokens=["[UNK]"], show_progress=True)
tokenizer.train(files=['/content/dataset.txt'], trainer=trainer)

I wanted to use it for pre-training the BigBird model, but facing two issues:

I can’t seem to be able to use this snippet with the custom tokenizer above to convert tokenized sentences in model-friendly sequences

from tokenizers.processors import BertProcessing

tokenizer._tokenizer.post_processor = tokenizers.processors.BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=16000)

This returns me an error, and without any preprocessing the output does not contain the sequence start and end tokens (<s>; </s>) as expected.

Next problem arises, when I save the tokenizer state in the specified folder, I am unable to use it via:

tokenizer = BigBirdTokenizerFast.from_pretrained("./tok", max_len=16000)

since it yields the error that my directory does not ‘reference’ the tokenizer files, which shouldn’t be an issue since using RobertaTokenizerFast does work - I assume it has something to do in the tokenization post-processing phase.

Fully Reproducible Colab

I am really confused about this - I have created a fully reproducible colab notebook, with commented problems and synthetic data. Please find it here.

Thanx a ton in advance!!

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

neel04commented, Jun 9, 2021

Thanks a ton @LysandreJik and replying so quickly and efficiently 🍰 👍 🚀 !!!

For anyone else who might stumble on this problem, I have modified a simple example via the Colab link attached above. If in any case it might not be working, I have uploaded the .ipynb file alongside this comment too. 🤗

Have a fantastic day!

HF_issue_repro.zip

0reactions

neel04commented, Jun 28, 2021

Hey @LysandreJik, Thanks a ton for the tips, I will surely try them if I face this error again! 🤗

I am using the master branch now for my project, so I hope I won’t face this problem again. However, I can’t completely verify whether it works because I am unable to run it on TPU due to some memory leak.

If related problems arise, I would surely try out either of your fixes 🚀

Have a fantastic day!