question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using whitespace tokenizer for training models

See original GitHub issue

Environment info

  • transformers version: 4.6.1
  • Platform: Linux-5.4.109±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.8.1+cu101 (False)
  • Tensorflow version (GPU?): 2.5.0 (False)
  • Using GPU in script?: Yes/depends
  • Using distributed or parallel set-up in script?: No

Who can help

Information

Model I am using (Bert, XLNet …): BigBird

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

I have a dataset for which I wanted to use a tokenizer based on whitespace rather than any subword segmentation approach.

This snippet I got off github has a way to construct and use the custom tokenizer that operates on whitespaces:-

from tokenizers import Tokenizer, trainers
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import CharDelimiterSplit

# We build our custom tokenizer:
tokenizer = Tokenizer(BPE()) 
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = CharDelimiterSplit(' ')

# We can train this tokenizer by giving it a list of path to text files:
trainer = trainers.BpeTrainer(special_tokens=["[UNK]"], show_progress=True)
tokenizer.train(files=['/content/dataset.txt'], trainer=trainer)

I wanted to use it for pre-training the BigBird model, but facing two issues:

  1. I can’t seem to be able to use this snippet with the custom tokenizer above to convert tokenized sentences in model-friendly sequences
from tokenizers.processors import BertProcessing

tokenizer._tokenizer.post_processor = tokenizers.processors.BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=16000)

This returns me an error, and without any preprocessing the output does not contain the sequence start and end tokens (<s>; </s>) as expected.

  1. Next problem arises, when I save the tokenizer state in the specified folder, I am unable to use it via:
tokenizer = BigBirdTokenizerFast.from_pretrained("./tok", max_len=16000)

since it yields the error that my directory does not ‘reference’ the tokenizer files, which shouldn’t be an issue since using RobertaTokenizerFast does work - I assume it has something to do in the tokenization post-processing phase.

Fully Reproducible Colab

I am really confused about this - I have created a fully reproducible colab notebook, with commented problems and synthetic data. Please find it here.

Thanx a ton in advance!!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
neel04commented, Jun 9, 2021

Thanks a ton @LysandreJik and replying so quickly and efficiently 🍰 👍 🚀 !!!

For anyone else who might stumble on this problem, I have modified a simple example via the Colab link attached above. If in any case it might not be working, I have uploaded the .ipynb file alongside this comment too. 🤗

Have a fantastic day!

HF_issue_repro.zip

0reactions
neel04commented, Jun 28, 2021

Hey @LysandreJik, Thanks a ton for the tips, I will surely try them if I face this error again! 🤗

I am using the master branch now for my project, so I hope I won’t face this problem again. However, I can’t completely verify whether it works because I am unable to run it on TPU due to some memory leak.

If related problems arise, I would surely try out either of your fixes 🚀

Have a fantastic day!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using whitespace tokenizer for training models
I have a dataset for which I wanted to use a tokenizer based on whitespace rather than any subword segmentation approach.
Read more >
Whitespace tokenizer when training on cli #4991 - GitHub
I'm training NER models using the spacy train command. When using this model to make predictions (by using nlp = spacy.load(path/to/model) ...
Read more >
Whitespace tokenizer for training BERT language model from ...
I am trying to train a BERT language model from scratch using Huggingface API. For that I need to build a tokenizer that...
Read more >
Whitespace tokenizer | Elasticsearch Guide [8.5] | Elastic
The whitespace tokenizer breaks text into terms whenever it encounters a whitespace character. Example outputedit. POST _analyze { "tokenizer": ...
Read more >
How can Tensorflow text be used with whitespace tokenizer in ...
Tensorflow text can be used with whitespace tokenizer by calling the 'WhitespaceTokenizer'', which creates a tokenizer, that is used with ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found