Load large text file for LM pre-training resulting in OOM
See original GitHub issueI tried to pretrain Longformer using transformers and datasets. But I got OOM issues with loading a large text file. My script is almost like this:
from datasets import load_dataset
@dataclass
class DataCollatorForDatasetsLanguageModeling(DataCollatorForLanguageModeling):
"""
Data collator used for language modeling based on DataCollatorForLazyLanguageModeling
- collates batches of tensors, honoring their tokenizer's pad_token
- preprocesses batches for masked language modeling
"""
block_size: int = 512
def __call__(self, examples: List[dict]) -> Dict[str, torch.Tensor]:
examples = [example['text'] for example in examples]
batch, attention_mask = self._tensorize_batch(examples)
if self.mlm:
inputs, labels = self.mask_tokens(batch)
return {"input_ids": inputs, "labels": labels}
else:
labels = batch.clone().detach()
if self.tokenizer.pad_token_id is not None:
labels[labels == self.tokenizer.pad_token_id] = -100
return {"input_ids": batch, "labels": labels}
def _tensorize_batch(self, examples: List[str]) -> Tuple[torch.Tensor, torch.Tensor]:
if self.tokenizer._pad_token is None:
raise ValueError(
"You are attempting to pad samples but the tokenizer you are using"
f" ({self.tokenizer.__class__.__name__}) does not have one."
)
tensor_examples = self.tokenizer.batch_encode_plus(
[ex for ex in examples if ex],
max_length=self.block_size,
return_tensors="pt",
pad_to_max_length=True,
return_attention_mask=True,
truncation=True,
)
input_ids, attention_mask = tensor_examples["input_ids"], tensor_examples["attention_mask"]
return input_ids, attention_mask
dataset = load_dataset('text', data_files='train.txt',cache_dir="./", , split='train')
data_collator = DataCollatorForDatasetsLanguageModeling(tokenizer=tokenizer, mlm=True,
mlm_probability=0.15, block_size=tokenizer.max_len)
trainer = Trainer(model=model, args=args, data_collator=data_collator,
train_dataset=train_dataset, prediction_loss_only=True, )
trainer.train(model_path=model_path)
This train.txt is about 1.1GB and has 90k lines where each line is a sequence of 4k words. During training, the memory usage increased fast as the following graph and resulted in OOM before the finish of training.
Could you please give me any suggestions on why this happened and how to fix it? Thanks.
Issue Analytics
- State:
- Created 3 years ago
- Comments:27 (9 by maintainers)
Top Results From Across the Web
Training text corpus too big to load into memory - Stack Overflow
File is to big to fit into RAM memory. Instead of trying to read everything into memory at once try processing it line...
Read more >A pipeline for large raw text preprocessing and model training ...
Once pre- trained on large, unlabelled corpora, we can apply transfer learning to virtually all down- stream tasks. The paradigmatic example is the...
Read more >Pipelines - Hugging Face
from transformers import pipeline pipe = pipeline("text-classification") def data(): while True: # This could come from a dataset, a database, a queue or...
Read more >Prompting methods with language models and their ...
You take your language model and train it over a very large amount of text—think Wikipedia, or all of Reddit, or something like...
Read more >Language Models are Unsupervised Multitask Learners - 2018
Section 3 contains detailed descriptions of each result. utilize a combination of pre-training and supervised fine- tuning. This approach has a long history ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@lhoestq sure. Here you have https://colab.research.google.com/drive/1ba09ZOpyHGAOQLcsxiQAHRXl10qnMU5o?usp=sharing let me know if the link works and it reproduces the issue. To me, it reproduces the issue, since if you start the training the ram memory keeps increasing.
Let me know. Thanks!
@lhoestq could be, but if we set wandb to false this should not happen. I am going to try.