question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trainer makes RAM go out of memory after a while

See original GitHub issue

Environment info

  • transformers version: 3.4.0
  • Platform: Linux-4.14.193-113.317.amzn1.x86_64-x86_64-with-glibc2.9
  • Python version: 3.6.10
  • PyTorch version (GPU?): 1.6.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: True
  • Using distributed or parallel set-up in script?: False

Who can help

@sgugger @patrickvonplaten

Information

Model I am using: T5

The problem arises when using my own modified scripts: I load my dataset this way:

def tokenize(batch):
    tokenized_input = tokenizer(batch[text_column], padding=True, truncation=True, max_length=153)
    tokenized_label = tokenizer(batch[generated_column], padding=True, truncation=True, max_length=274)

    tokenized_input['labels'] = tokenized_label['input_ids']

    return tokenized_input

dataset = load_dataset('csv', data_files=dataset_file, split='train')
dataset = dataset.train_test_split(test_size=0.05, seed=SEED)
train_dataset = dataset['train']
val_dataset = dataset['test']

train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
val_dataset = val_dataset.map(tokenize, batched=True, batch_size=len(val_dataset))
train_dataset.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset.set_format('numpy', columns=['input_ids', 'attention_mask', 'labels'])

And then I use Trainer to train my T5 model like this:

training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=1,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
eval_accumulation_steps=1,
learning_rate=0.001,
evaluation_strategy='steps',
save_steps=1000000,
save_total_limit=1,
remove_unused_columns=True,
run_name=now,
logging_steps=100,
eval_steps=100,
logging_first_step=True
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset
)

trainer.train()

The tasks I am working on is my own task or dataset: I am using a custom dataset for machine translation which has 12MB size and 18.000 examples. The sequence max token sizes are 153 for input and 274 for output. I have also added 68 special tokens as the dataset has many symbols in it.

To reproduce

Steps to reproduce the behavior:

  1. Load a dataset like I did.
  2. Start training using Trainer
  3. During every evaluation, RAM usage grows and is not freed. So the next evaluation step accumulates other RAM and so on, until you reach the maximum and the training stops giving this error: RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 281882432 bytes. Error code 12 (Cannot allocate memory) (The machine I am using has 60GB RAM).

Expected behavior

The evaluation RAM should be freed after every step. Looks like something gets accumulated while training and RAM is not freed. I get the same behavior if I don’t run training but only evaluation: after many evaluation steps the RAM blows up.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

4reactions
sguggercommented, Oct 29, 2020

During evaluation, we need to store predictions and labels too, for the metric computation. If you want to store the loss only, then pass along the flag prediction_loss_only=True to your training arguments, which will use less more RAM (and you can then probably remove the eval_accumulation_steps=1 to speed up evaluation).

1reaction
sguggercommented, Oct 29, 2020

Should even be automatic now as I just merged a PR on master where the Trainer does not bother saving the predictions when there is no compute_metrics (which is your case here).

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA out of memory when using Trainer with compute_metrics
When I set a tiny batch size, the memory will not fill up at once, but the occupancy rate has increased until the...
Read more >
tensorflow - Training runs out of memory as RAM consumption ...
Still, I am observing a continuous increase of memory consumption over time. The screenshot below shows the consumption after a restart. At ...
Read more >
sudden increase of ram and get an out of memory error while i ...
I use matlab r2020b in Windows 10 and the gpu is 2080ti, ram is 64GB. i created a simulink environment and import this...
Read more >
CPU RAM Usage Keeps Growing as Training One Cycle
I might be seeing things but it seems that RAM usage climbs very slowly through the epoch - when I train with 200k...
Read more >
Allocating Memory - Princeton Research Computing
There are no Slurm directives for specifying the GPU memory. In the event of an out-of-memory (OOM) error, one must modify the application...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found