Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Load large text file for LM pre-training resulting in OOM

See original GitHub issue

I tried to pretrain Longformer using transformers and datasets. But I got OOM issues with loading a large text file. My script is almost like this:

from datasets import load_dataset

@dataclass
class DataCollatorForDatasetsLanguageModeling(DataCollatorForLanguageModeling):
    """
    Data collator used for language modeling based on DataCollatorForLazyLanguageModeling
    - collates batches of tensors, honoring their tokenizer's pad_token
    - preprocesses batches for masked language modeling
    """

    block_size: int = 512

    def __call__(self, examples: List[dict]) -> Dict[str, torch.Tensor]:
        examples = [example['text'] for example in examples]
        batch, attention_mask = self._tensorize_batch(examples)
        if self.mlm:
            inputs, labels = self.mask_tokens(batch)
            return {"input_ids": inputs, "labels": labels}
        else:
            labels = batch.clone().detach()
            if self.tokenizer.pad_token_id is not None:
                labels[labels == self.tokenizer.pad_token_id] = -100
            return {"input_ids": batch, "labels": labels}

    def _tensorize_batch(self, examples: List[str]) -> Tuple[torch.Tensor, torch.Tensor]:

        if self.tokenizer._pad_token is None:
            raise ValueError(
                "You are attempting to pad samples but the tokenizer you are using"
                f" ({self.tokenizer.__class__.__name__}) does not have one."
            )

        tensor_examples = self.tokenizer.batch_encode_plus(
            [ex for ex in examples if ex],
            max_length=self.block_size,
            return_tensors="pt",
            pad_to_max_length=True,
            return_attention_mask=True,
            truncation=True,
        )

        input_ids, attention_mask = tensor_examples["input_ids"], tensor_examples["attention_mask"]
        return input_ids, attention_mask

dataset = load_dataset('text', data_files='train.txt',cache_dir="./", , split='train')
data_collator = DataCollatorForDatasetsLanguageModeling(tokenizer=tokenizer, mlm=True, 
                      mlm_probability=0.15, block_size=tokenizer.max_len)
trainer = Trainer(model=model, args=args, data_collator=data_collator,
                      train_dataset=train_dataset, prediction_loss_only=True, )
trainer.train(model_path=model_path)

This train.txt is about 1.1GB and has 90k lines where each line is a sequence of 4k words. During training, the memory usage increased fast as the following graph and resulted in OOM before the finish of training.

Could you please give me any suggestions on why this happened and how to fix it? Thanks.

Issue Analytics

State:
Created 3 years ago
Comments:27 (9 by maintainers)

Top GitHub Comments

2reactions

gaceladricommented, Feb 15, 2021

@lhoestq sure. Here you have https://colab.research.google.com/drive/1ba09ZOpyHGAOQLcsxiQAHRXl10qnMU5o?usp=sharing let me know if the link works and it reproduces the issue. To me, it reproduces the issue, since if you start the training the ram memory keeps increasing.

Let me know. Thanks!

1reaction

gaceladricommented, Feb 15, 2021

@lhoestq could be, but if we set wandb to false this should not happen. I am going to try.

Top Results From Across the Web

Training text corpus too big to load into memory - Stack Overflow

File is to big to fit into RAM memory. Instead of trying to read everything into memory at once try processing it line...

A pipeline for large raw text preprocessing and model training ...

Once pre- trained on large, unlabelled corpora, we can apply transfer learning to virtually all down- stream tasks. The paradigmatic example is the...

Pipelines - Hugging Face

from transformers import pipeline pipe = pipeline("text-classification") def data(): while True: # This could come from a dataset, a database, a queue or...

Prompting methods with language models and their ...

You take your language model and train it over a very large amount of text—think Wikipedia, or all of Reddit, or something like...

Language Models are Unsupervised Multitask Learners - 2018

Section 3 contains detailed descriptions of each result. utilize a combination of pre-training and supervised fine- tuning. This approach has a long history ......