Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trainer makes RAM go out of memory after a while

See original GitHub issue

Environment info

transformers version: 3.4.0
Platform: Linux-4.14.193-113.317.amzn1.x86_64-x86_64-with-glibc2.9
Python version: 3.6.10
PyTorch version (GPU?): 1.6.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: True
Using distributed or parallel set-up in script?: False

Who can help

@sgugger @patrickvonplaten

Information

Model I am using: T5

The problem arises when using my own modified scripts: I load my dataset this way:

def tokenize(batch):
    tokenized_input = tokenizer(batch[text_column], padding=True, truncation=True, max_length=153)
    tokenized_label = tokenizer(batch[generated_column], padding=True, truncation=True, max_length=274)

    tokenized_input['labels'] = tokenized_label['input_ids']

    return tokenized_input

dataset = load_dataset('csv', data_files=dataset_file, split='train')
dataset = dataset.train_test_split(test_size=0.05, seed=SEED)
train_dataset = dataset['train']
val_dataset = dataset['test']

train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
val_dataset = val_dataset.map(tokenize, batched=True, batch_size=len(val_dataset))
train_dataset.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])

And then I use Trainer to train my T5 model like this:

training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=1,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
eval_accumulation_steps=1,
learning_rate=0.001,
evaluation_strategy='steps',
save_steps=1000000,
save_total_limit=1,
remove_unused_columns=True,
run_name=now,
logging_steps=100,
eval_steps=100,
logging_first_step=True
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset
)

trainer.train()

The tasks I am working on is my own task or dataset: I am using a custom dataset for machine translation which has 12MB size and 18.000 examples. The sequence max token sizes are 153 for input and 274 for output. I have also added 68 special tokens as the dataset has many symbols in it.

To reproduce

Steps to reproduce the behavior:

Load a dataset like I did.
Start training using Trainer
During every evaluation, RAM usage grows and is not freed. So the next evaluation step accumulates other RAM and so on, until you reach the maximum and the training stops giving this error: RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 281882432 bytes. Error code 12 (Cannot allocate memory) (The machine I am using has 60GB RAM).

Expected behavior

The evaluation RAM should be freed after every step. Looks like something gets accumulated while training and RAM is not freed. I get the same behavior if I don’t run training but only evaluation: after many evaluation steps the RAM blows up.

Issue Analytics

State:
Created 3 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

4reactions

sguggercommented, Oct 29, 2020

During evaluation, we need to store predictions and labels too, for the metric computation. If you want to store the loss only, then pass along the flag prediction_loss_only=True to your training arguments, which will use less more RAM (and you can then probably remove the eval_accumulation_steps=1 to speed up evaluation).

1reaction

sguggercommented, Oct 29, 2020

Should even be automatic now as I just merged a PR on master where the Trainer does not bother saving the predictions when there is no compute_metrics (which is your case here).