question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Environment info

  • transformers version: 3.1.0
  • Platform: Linux-3.10.0-1127.el7.x86_64-x86_64-with-debian-buster-sid
  • Python version: 3.7.0
  • PyTorch version (GPU?): 1.5.1 (True)
  • Tensorflow version (GPU?): 2.2.0 (False)
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: distributed

Who can help

@LysandreJik, @sgugger, @patrickvonplaten

Information

Model I am using (Bert, GPT2):

The problem arises when using:

  • [ X] the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [X ] my own task or dataset: (give details below)

To reproduce

When I pretrain or fine tune a model (in my case BERT and GPT2) using torch.distributed.launch, the CPU memory usage will grow up to the memory limit (>500GB) until the first process is killed due to this issue. If I train bert-base, it takes around 30 epochs until the first process is killed, but when I train gpt-large, it just need 3 epochs until it is killed. Following is the command line I run to train/fine tune the bert-base (similar with gpt2). The script run_language_modeling.py is a copy of transformers/examples/language-modeling/run_language_modeling.py (vers. 3.1.0)

python -m torch.distributed.launch --nproc_per_node=8
…/run_language_modeling.py
–output_dir $model_target
–model_name_or_path $model_source
–config_name $model_source
–tokenizer_name $model_source
–train_data_file $target_train
–eval_data_file $target_test
–save_total_limit 5
–block_size 128
–overwrite_output_dir
–fp16
–num_train_epochs 50
–do_train --do_eval
–per_device_train_batch_size 32
–per_device_eval_batch_size 4
–mlm

Expected behavior

I would expect that the distributed training run until it is done without any memory issue. Thanks for checking it.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
xesdinycommented, Sep 24, 2020
1reaction
cahya-wirawancommented, Sep 22, 2020

The size of dataset (indonesian Wikipedia) is around 522MB.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory leak - Wikipedia
In computer science, a memory leak is a type of resource leak that occurs when a computer program incorrectly manages memory allocations in...
Read more >
What is Memory Leak? How can we avoid? - GeeksforGeeks
Memory leak occurs when programmers create a memory in heap and forget to delete it. The consequences of memory leak is that it...
Read more >
Definition of memory leak - PCMag
When memory is allocated, but not deallocated, a memory leak occurs (the memory has leaked out of the computer). If too many memory...
Read more >
Memory leak - OWASP Foundation
A memory leak is an unintentional form of memory consumption whereby the developer fails to free an allocated block of memory when no...
Read more >
Find a memory leak - Windows drivers - Microsoft Learn
A memory leak occurs when a process allocates memory from the paged or nonpaged pools, but doesn't free the memory.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found