Trainer makes RAM go out of memory after a while
See original GitHub issueEnvironment info
transformers
version: 3.4.0- Platform: Linux-4.14.193-113.317.amzn1.x86_64-x86_64-with-glibc2.9
- Python version: 3.6.10
- PyTorch version (GPU?): 1.6.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: True
- Using distributed or parallel set-up in script?: False
Who can help
Information
Model I am using: T5
The problem arises when using my own modified scripts: I load my dataset this way:
def tokenize(batch):
tokenized_input = tokenizer(batch[text_column], padding=True, truncation=True, max_length=153)
tokenized_label = tokenizer(batch[generated_column], padding=True, truncation=True, max_length=274)
tokenized_input['labels'] = tokenized_label['input_ids']
return tokenized_input
dataset = load_dataset('csv', data_files=dataset_file, split='train')
dataset = dataset.train_test_split(test_size=0.05, seed=SEED)
train_dataset = dataset['train']
val_dataset = dataset['test']
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
val_dataset = val_dataset.map(tokenize, batched=True, batch_size=len(val_dataset))
train_dataset.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])
And then I use Trainer to train my T5 model like this:
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=1,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
eval_accumulation_steps=1,
learning_rate=0.001,
evaluation_strategy='steps',
save_steps=1000000,
save_total_limit=1,
remove_unused_columns=True,
run_name=now,
logging_steps=100,
eval_steps=100,
logging_first_step=True
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset
)
trainer.train()
The tasks I am working on is my own task or dataset: I am using a custom dataset for machine translation which has 12MB size and 18.000 examples. The sequence max token sizes are 153 for input and 274 for output. I have also added 68 special tokens as the dataset has many symbols in it.
To reproduce
Steps to reproduce the behavior:
- Load a dataset like I did.
- Start training using Trainer
- During every evaluation, RAM usage grows and is not freed. So the next evaluation step accumulates other RAM and so on, until you reach the maximum and the training stops giving this error:
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 281882432 bytes. Error code 12 (Cannot allocate memory)
(The machine I am using has 60GB RAM).
Expected behavior
The evaluation RAM should be freed after every step. Looks like something gets accumulated while training and RAM is not freed. I get the same behavior if I don’t run training but only evaluation: after many evaluation steps the RAM blows up.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
CUDA out of memory when using Trainer with compute_metrics
When I set a tiny batch size, the memory will not fill up at once, but the occupancy rate has increased until the...
Read more >tensorflow - Training runs out of memory as RAM consumption ...
Still, I am observing a continuous increase of memory consumption over time. The screenshot below shows the consumption after a restart. At ...
Read more >sudden increase of ram and get an out of memory error while i ...
I use matlab r2020b in Windows 10 and the gpu is 2080ti, ram is 64GB. i created a simulink environment and import this...
Read more >CPU RAM Usage Keeps Growing as Training One Cycle
I might be seeing things but it seems that RAM usage climbs very slowly through the epoch - when I train with 200k...
Read more >Allocating Memory - Princeton Research Computing
There are no Slurm directives for specifying the GPU memory. In the event of an out-of-memory (OOM) error, one must modify the application...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
During evaluation, we need to store predictions and labels too, for the metric computation. If you want to store the loss only, then pass along the flag
prediction_loss_only=True
to your training arguments, which will use less more RAM (and you can then probably remove theeval_accumulation_steps=1
to speed up evaluation).Should even be automatic now as I just merged a PR on master where the Trainer does not bother saving the predictions when there is no
compute_metrics
(which is your case here).