Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataCollatorForWholeWordMask only works for BERT, and nothing is said in the docstring.

See original GitHub issue

Environment info

transformers version:
Platform:
Python version:
PyTorch version (GPU?):
Tensorflow version (GPU?):
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help

@patrickvonplaten @LysandreJik @patil-suraj @sgugger

Information

Model I am using (Bert, XLNet …): DBERTA (V1) BASE

The problem arises when using:

the official example scripts: (give details below): DataCollatorForWholeWordMask
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

The DataCollatorForWholeWordMask, that should be used for pre-training a Roberta model, or a Deberta model for example (as you don’t have a SpanCollator), only works for BERT, and one needs to look the details in the collator code to notice this. I’ve been training a language model from scratch for weeks now, just to notice yesterday that your collator for WholeWordMask is wrong and only works for BERT.

Steps to reproduce the behavior:

Try to use the DataCollatorForWholeWordMask with any model that is not BERT.

Expected behavior

A data collator that is included in your data collators should work generally for any model, not only for BERT. Or at least, in the Docstring it should be clear that one will waste huge amounts of money if using this collator for other models that are not BERT. This being said, I would like to know how could I use the word_ids from the tokenizer to do this, as with the TokenClassification example you provide here: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb#scrollTo=MOsHUjgdIrIW In this example the extension of the token labels doesn’t depend on the continuation token having “##” at the beginning, but uses the word ids from the FastTokenizer. I think the DataCollatorForWholeWordMask should work generally, at least for all fast tokenizers, not only for BERT. For my case, I would like to know what can I do to at least train a little bit more with the correct objective, not with normal MLM but with WWMLM.

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

alexvaca0commented, Jun 25, 2021

The problem is that after passing through datasets, the objects are dicts, not BatchEncoding, therefore they don’t have the word_ids() method, and without that we cannot generalize Whole Word Masking. One solution is to pre tokenize and pre process the dataset inside the function you put in the datasets map, however you disable dynamic batching which is a key improvement of Roberta with respect to Bert.

0reactions

ionicsolutionscommented, Jun 25, 2021

Thank you for elaborating!

Similarly to the implementation for BERT tokenizers in the current DataCollatorForWholeWordMasking, it is possible to obtain a word start mask for RoBERTa tokenizers by decoding every token in the collator by using something like this:

def _word_starts(self, inputs: torch.Tensor) -> torch.Tensor:
    is_word_start = torch.full_like(inputs, fill_value=False)
    for i, example in enumerate(torch.split(inputs, split_size_or_sections=1, dim=0)):
        line_mask = torch.tensor([self.tokenizer.decode([t]).startswith(" ") for t in example.flatten().tolist()
                                  if t != self.tokenizer.pad_token_id])
        is_word_start[i, 0:line_mask.shape[0]] = line_mask
    return is_word_start

I believe that this is accurate if the tokenizer is initialized with add_prefix_space=True, otherwise the first word is missing, which is probably acceptable in most circumstances.

If this method is correct, it could be extended to BART tokenizers, where the condition for the first token of a word is not tokenizer.decode([t]).startswith('##'). I’m not sure whether this is a path one wants to take here, though.

Top Results From Across the Web

DataCollatorForWholeWordMask is missing _tensorize_batch ...

Information. Model I am using (Bert, XLNet ...):BERT. The problem arises when using:.

Source code for transformers.data.data_collator - Hugging Face

:obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., ... "DataCollatorForWholeWordMask is only suitable for BertTokenizer-like tokenizers.

Outputting attention for bert-base-uncased with huggingface ...

In between the underlying model indeed returns attentions, but the wrapper does not care and only returns the logits.

How to use BERT from the Hugging Face transformer library

In this article, I will demonstrate how to use BERT using the Hugging Face Transformer library for four important tasks.

Semantic Code Search Using Transformers and BERT- Part II ...

The docstrings are converted to vectors using a pretrained ALBERT model which ... word embeddings are 1/6th as long, with only 128 features....