DataCollatorForWholeWordMask only works for BERT, and nothing is said in the docstring.
See original GitHub issueEnvironment info
transformers
version:- Platform:
- Python version:
- PyTorch version (GPU?):
- Tensorflow version (GPU?):
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help
@patrickvonplaten @LysandreJik @patil-suraj @sgugger
Information
Model I am using (Bert, XLNet …): DBERTA (V1) BASE
The problem arises when using:
- the official example scripts: (give details below): DataCollatorForWholeWordMask
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
The DataCollatorForWholeWordMask, that should be used for pre-training a Roberta model, or a Deberta model for example (as you don’t have a SpanCollator), only works for BERT, and one needs to look the details in the collator code to notice this. I’ve been training a language model from scratch for weeks now, just to notice yesterday that your collator for WholeWordMask is wrong and only works for BERT.
Steps to reproduce the behavior:
- Try to use the DataCollatorForWholeWordMask with any model that is not BERT.
Expected behavior
A data collator that is included in your data collators should work generally for any model, not only for BERT. Or at least, in the Docstring it should be clear that one will waste huge amounts of money if using this collator for other models that are not BERT. This being said, I would like to know how could I use the word_ids from the tokenizer to do this, as with the TokenClassification example you provide here: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb#scrollTo=MOsHUjgdIrIW In this example the extension of the token labels doesn’t depend on the continuation token having “##” at the beginning, but uses the word ids from the FastTokenizer. I think the DataCollatorForWholeWordMask should work generally, at least for all fast tokenizers, not only for BERT. For my case, I would like to know what can I do to at least train a little bit more with the correct objective, not with normal MLM but with WWMLM.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
The problem is that after passing through datasets, the objects are dicts, not BatchEncoding, therefore they don’t have the word_ids() method, and without that we cannot generalize Whole Word Masking. One solution is to pre tokenize and pre process the dataset inside the function you put in the datasets map, however you disable dynamic batching which is a key improvement of Roberta with respect to Bert.
Thank you for elaborating!
Similarly to the implementation for BERT tokenizers in the current
DataCollatorForWholeWordMasking
, it is possible to obtain a word start mask for RoBERTa tokenizers by decoding every token in the collator by using something like this:I believe that this is accurate if the tokenizer is initialized with
add_prefix_space=True
, otherwise the first word is missing, which is probably acceptable in most circumstances.If this method is correct, it could be extended to BART tokenizers, where the condition for the first token of a word is
not tokenizer.decode([t]).startswith('##')
. I’m not sure whether this is a path one wants to take here, though.