question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Load large text file for LM pre-training resulting in OOM

See original GitHub issue

I tried to pretrain Longformer using transformers and datasets. But I got OOM issues with loading a large text file. My script is almost like this:

from datasets import load_dataset

@dataclass
class DataCollatorForDatasetsLanguageModeling(DataCollatorForLanguageModeling):
    """
    Data collator used for language modeling based on DataCollatorForLazyLanguageModeling
    - collates batches of tensors, honoring their tokenizer's pad_token
    - preprocesses batches for masked language modeling
    """

    block_size: int = 512

    def __call__(self, examples: List[dict]) -> Dict[str, torch.Tensor]:
        examples = [example['text'] for example in examples]
        batch, attention_mask = self._tensorize_batch(examples)
        if self.mlm:
            inputs, labels = self.mask_tokens(batch)
            return {"input_ids": inputs, "labels": labels}
        else:
            labels = batch.clone().detach()
            if self.tokenizer.pad_token_id is not None:
                labels[labels == self.tokenizer.pad_token_id] = -100
            return {"input_ids": batch, "labels": labels}

    def _tensorize_batch(self, examples: List[str]) -> Tuple[torch.Tensor, torch.Tensor]:

        if self.tokenizer._pad_token is None:
            raise ValueError(
                "You are attempting to pad samples but the tokenizer you are using"
                f" ({self.tokenizer.__class__.__name__}) does not have one."
            )

        tensor_examples = self.tokenizer.batch_encode_plus(
            [ex for ex in examples if ex],
            max_length=self.block_size,
            return_tensors="pt",
            pad_to_max_length=True,
            return_attention_mask=True,
            truncation=True,
        )

        input_ids, attention_mask = tensor_examples["input_ids"], tensor_examples["attention_mask"]
        return input_ids, attention_mask

dataset = load_dataset('text', data_files='train.txt',cache_dir="./", , split='train')
data_collator = DataCollatorForDatasetsLanguageModeling(tokenizer=tokenizer, mlm=True, 
                      mlm_probability=0.15, block_size=tokenizer.max_len)
trainer = Trainer(model=model, args=args, data_collator=data_collator,
                      train_dataset=train_dataset, prediction_loss_only=True, )
trainer.train(model_path=model_path)

This train.txt is about 1.1GB and has 90k lines where each line is a sequence of 4k words. During training, the memory usage increased fast as the following graph and resulted in OOM before the finish of training.

image

Could you please give me any suggestions on why this happened and how to fix it? Thanks.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:27 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
gaceladricommented, Feb 15, 2021

@lhoestq sure. Here you have https://colab.research.google.com/drive/1ba09ZOpyHGAOQLcsxiQAHRXl10qnMU5o?usp=sharing let me know if the link works and it reproduces the issue. To me, it reproduces the issue, since if you start the training the ram memory keeps increasing.

Let me know. Thanks!

1reaction
gaceladricommented, Feb 15, 2021

@lhoestq could be, but if we set wandb to false this should not happen. I am going to try.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training text corpus too big to load into memory - Stack Overflow
File is to big to fit into RAM memory. Instead of trying to read everything into memory at once try processing it line...
Read more >
A pipeline for large raw text preprocessing and model training ...
Once pre- trained on large, unlabelled corpora, we can apply transfer learning to virtually all down- stream tasks. The paradigmatic example is the...
Read more >
Pipelines - Hugging Face
from transformers import pipeline pipe = pipeline("text-classification") def data(): while True: # This could come from a dataset, a database, a queue or...
Read more >
Prompting methods with language models and their ...
You take your language model and train it over a very large amount of text—think Wikipedia, or all of Reddit, or something like...
Read more >
Language Models are Unsupervised Multitask Learners - 2018
Section 3 contains detailed descriptions of each result. utilize a combination of pre-training and supervised fine- tuning. This approach has a long history ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found