question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataCollatorForWholeWordMask only works for BERT, and nothing is said in the docstring.

See original GitHub issue

Environment info

  • transformers version:
  • Platform:
  • Python version:
  • PyTorch version (GPU?):
  • Tensorflow version (GPU?):
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help

@patrickvonplaten @LysandreJik @patil-suraj @sgugger

Information

Model I am using (Bert, XLNet …): DBERTA (V1) BASE

The problem arises when using:

  • the official example scripts: (give details below): DataCollatorForWholeWordMask
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

The DataCollatorForWholeWordMask, that should be used for pre-training a Roberta model, or a Deberta model for example (as you don’t have a SpanCollator), only works for BERT, and one needs to look the details in the collator code to notice this. I’ve been training a language model from scratch for weeks now, just to notice yesterday that your collator for WholeWordMask is wrong and only works for BERT.

Steps to reproduce the behavior:

  1. Try to use the DataCollatorForWholeWordMask with any model that is not BERT.

Expected behavior

A data collator that is included in your data collators should work generally for any model, not only for BERT. Or at least, in the Docstring it should be clear that one will waste huge amounts of money if using this collator for other models that are not BERT. This being said, I would like to know how could I use the word_ids from the tokenizer to do this, as with the TokenClassification example you provide here: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb#scrollTo=MOsHUjgdIrIW In this example the extension of the token labels doesn’t depend on the continuation token having “##” at the beginning, but uses the word ids from the FastTokenizer. I think the DataCollatorForWholeWordMask should work generally, at least for all fast tokenizers, not only for BERT. For my case, I would like to know what can I do to at least train a little bit more with the correct objective, not with normal MLM but with WWMLM.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
alexvaca0commented, Jun 25, 2021

The problem is that after passing through datasets, the objects are dicts, not BatchEncoding, therefore they don’t have the word_ids() method, and without that we cannot generalize Whole Word Masking. One solution is to pre tokenize and pre process the dataset inside the function you put in the datasets map, however you disable dynamic batching which is a key improvement of Roberta with respect to Bert.

0reactions
ionicsolutionscommented, Jun 25, 2021

Thank you for elaborating!

Similarly to the implementation for BERT tokenizers in the current DataCollatorForWholeWordMasking, it is possible to obtain a word start mask for RoBERTa tokenizers by decoding every token in the collator by using something like this:

def _word_starts(self, inputs: torch.Tensor) -> torch.Tensor:
    is_word_start = torch.full_like(inputs, fill_value=False)
    for i, example in enumerate(torch.split(inputs, split_size_or_sections=1, dim=0)):
        line_mask = torch.tensor([self.tokenizer.decode([t]).startswith(" ") for t in example.flatten().tolist()
                                  if t != self.tokenizer.pad_token_id])
        is_word_start[i, 0:line_mask.shape[0]] = line_mask
    return is_word_start

I believe that this is accurate if the tokenizer is initialized with add_prefix_space=True, otherwise the first word is missing, which is probably acceptable in most circumstances.

If this method is correct, it could be extended to BART tokenizers, where the condition for the first token of a word is not tokenizer.decode([t]).startswith('##'). I’m not sure whether this is a path one wants to take here, though.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DataCollatorForWholeWordMask is missing _tensorize_batch ...
Information. Model I am using (Bert, XLNet ...):BERT. The problem arises when using:.
Read more >
Source code for transformers.data.data_collator - Hugging Face
:obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., ... "DataCollatorForWholeWordMask is only suitable for BertTokenizer-like tokenizers.
Read more >
Outputting attention for bert-base-uncased with huggingface ...
In between the underlying model indeed returns attentions, but the wrapper does not care and only returns the logits.
Read more >
How to use BERT from the Hugging Face transformer library
In this article, I will demonstrate how to use BERT using the Hugging Face Transformer library for four important tasks.
Read more >
Semantic Code Search Using Transformers and BERT- Part II ...
The docstrings are converted to vectors using a pretrained ALBERT model which ... word embeddings are 1/6th as long, with only 128 features....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found