distributed training for transformer OOM
See original GitHub issueI’m attemping to do distributed training a big transformer model in fp16 using the following script. I receive CUDA out of memory issues. I’m using a p3.16xl on AWS, 8 volta v100 gpus 16gb on a single node. I know I can do the same training using a different distributed training technique by spawning child processes through multiprocessing, but my end goal is to bench this on multi-node. I don’t have slurm setup for this, but I’m following the instructions laid out at the end here manually starting one process per gpu: https://github.com/pytorch/fairseq/blob/master/docs/getting_started.rst
HOST_PORT="tcp://10.0.0.168:13333"
kill_children() {
for PID in ${PIDS[*]}; do
kill -TERM $PID
done
}
for i in $(seq 0 7); do
RANK=$i
python train.py data-bin/wmt14_en_de_joined_dict \
--arch transformer_vaswani_wmt_en_de_big \
--share-all-embeddings \
--optimizer adam \
--adam-betas '(0.9, 0.98)' \
--clip-norm 0.0 \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 \
--warmup-updates 4000 \
--lr 0.0005 \
--min-lr 1e-09 \
--dropout 0.3 \
--weight-decay 0.0 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--max-tokens 3584 --fp16 \
--distributed-world-size 8 \
--distributed-init-method $HOST_PORT \
--distributed-rank $RANK &
PIDS[$RANK]=$!
done
trap kill_children SIGTERM SIGINT
for PID in ${PIDS[*]}; do
wait $PID
done
This is the output:
| distributed init (rank 7): tcp://10.0.0.168:13333
| distributed init (rank 1): tcp://10.0.0.168:13333
| distributed init (rank 0): tcp://10.0.0.168:13333
| distributed init (rank 4): tcp://10.0.0.168:13333
| distributed init (rank 6): tcp://10.0.0.168:13333
| distributed init (rank 5): tcp://10.0.0.168:13333
| distributed init (rank 2): tcp://10.0.0.168:13333
| distributed init (rank 3): tcp://10.0.0.168:13333
| initialized host ip-10-0-0-168 as rank 0
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_vaswani_wmt_en_de_big', attention_dropout=0.0, clip_norm=0.0, criterion='label_smoothed_cross_entropy', data='data-bin/wmt14_en_de_joined_dict', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, distributed_backend='nccl', distributed_init_method='tcp://10.0.0.168:13333', distributed_port=-1, distributed_rank=0, distributed_world_size=8, dropout=0.3, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, fp16=True, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=3584, max_update=0, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.0, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', train_subset='train', update_freq=[1], upsample_primary=1, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
| [en] dictionary: 32768 types
| [de] dictionary: 32768 types
| data-bin/wmt14_en_de_joined_dict train 4528446 examples
| data-bin/wmt14_en_de_joined_dict valid 3000 examples
| model transformer_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 209911808
| training on 8 GPUs
| max tokens per GPU = 3584 and max sentences per GPU = None
| epoch 001: 0%| | 0/5492 [00:00<?, ?it/s]THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=15 error=2 : out of memory
Traceback (most recent call last):
File "train.py", line 356, in <module>
distributed_main(args)
File "/home/ubuntu/github/fairseq/distributed_train.py", line 39, in main
single_process_main(args)
File "/home/ubuntu/github/fairseq/train.py", line 95, in main
train(args, trainer, task, epoch_itr)
File "/home/ubuntu/github/fairseq/train.py", line 133, in train
log_output = trainer.train_step(sample, update_params=True)
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 144, in train_step
agg_logging_output = self._update_params()
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 163, in _update_params
(sample_sizes, logging_outputs, ooms_fwd, ooms_bwd)
File "/home/ubuntu/github/fairseq/fairseq/distributed_utils.py", line 73, in all_gather_list
in_buffer[0] = enc_size // 255 # this encoding works for max_size < 65k
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCTensorMath.cu:15
Traceback (most recent call last):
File "train.py", line 356, in <module>
distributed_main(args)
File "/home/ubuntu/github/fairseq/distributed_train.py", line 39, in main
single_process_main(args)
File "/home/ubuntu/github/fairseq/train.py", line 83, in main
trainer.dummy_train_step(dummy_batch)
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 342, in dummy_train_step
self.train_step(dummy_batch, update_params=False, dummy_batch=True)
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 133, in train_step
loss, sample_size, logging_output, oom_fwd = self._forward(sample)
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 235, in _forward
raise e
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 227, in _forward
loss, sample_size, logging_output_ = self.task.get_loss(self.model, self.criterion, sample)
File "/home/ubuntu/github/fairseq/fairseq/tasks/fairseq_task.py", line 157, in get_loss
return criterion(model, sample)
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/github/fairseq/fairseq/criterions/label_smoothed_cross_entropy.py", line 36, in forward
net_output = model(**sample['net_input'])
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/github/fairseq/fairseq/models/fairseq_model.py", line 159, in forward
encoder_out = self.encoder(src_tokens, src_lengths)
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/github/fairseq/fairseq/models/transformer.py", line 290, in forward
x = layer(x, encoder_padding_mask)
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/github/fairseq/fairseq/models/transformer.py", line 549, in forward
x, _ = self.self_attn(query=x, key=x, value=x, key_padding_mask=encoder_padding_mask)
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/github/fairseq/fairseq/modules/multihead_attention.py", line 80, in forward
q, k, v = self.in_proj_qkv(query)
File "/home/ubuntu/github/fairseq/fairseq/modules/multihead_attention.py", line 150, in in_proj_qkv
return self._in_proj(query).chunk(3, dim=-1)
File "/home/ubuntu/github/fairseq/fairseq/modules/multihead_attention.py", line 170, in _in_proj
return F.linear(input, weight, bias)
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py", line 1026, in linear
output = input.matmul(weight.t())
RuntimeError: cublas runtime error : resource allocation failed at /pytorch/aten/src/THC/THCGeneral.cpp:333
| WARNING: ran out of memory, skipping batch
Traceback (most recent call last):
File "train.py", line 356, in <module>
distributed_main(args)
File "/home/ubuntu/github/fairseq/distributed_train.py", line 39, in main
single_process_main(args)
File "/home/ubuntu/github/fairseq/train.py", line 83, in main
trainer.dummy_train_step(dummy_batch)
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 343, in dummy_train_step
self.zero_grad()
File "/home/ubuntu/github/fairseq/fairseq/fp16_trainer.py", line 94, in zero_grad
self.optimizer.zero_grad() # FP32
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 68, in optimizer
self._build_optimizer()
File "/home/ubuntu/github/fairseq/fairseq/fp16_trainer.py", line 66, in _build_optimizer
self.fp32_params = params[0].new(0).float().new(total_param_size)
RuntimeError: CUDA error: out of memory
| epoch 001: 0%| | 0/5492 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 356, in <module>
distributed_main(args)
File "/home/ubuntu/github/fairseq/distributed_train.py", line 39, in main
single_process_main(args)
File "/home/ubuntu/github/fairseq/train.py", line 83, in main
trainer.dummy_train_step(dummy_batch)
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 343, in dummy_train_step
self.zero_grad()
File "/home/ubuntu/github/fairseq/fairseq/fp16_trainer.py", line 94, in zero_grad
self.optimizer.zero_grad() # FP32
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 68, in optimizer
self._build_optimizer()
File "/home/ubuntu/github/fairseq/fairseq/fp16_trainer.py", line 73, in _build_optimizer
self.fp32_params.grad = self.fp32_params.data.new(total_param_size)
RuntimeError: CUDA error: out of memory
Traceback (most recent call last):
File "train.py", line 356, in <module>
distributed_main(args)
File "/home/ubuntu/github/fairseq/distributed_train.py", line 39, in main
single_process_main(args)
File "/home/ubuntu/github/fairseq/train.py", line 83, in main
trainer.dummy_train_step(dummy_batch)
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 343, in dummy_train_step
self.zero_grad()
File "/home/ubuntu/github/fairseq/fairseq/fp16_trainer.py", line 94, in zero_grad
self.optimizer.zero_grad() # FP32
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 68, in optimizer
self._build_optimizer()
File "/home/ubuntu/github/fairseq/fairseq/fp16_trainer.py", line 73, in _build_optimizer
self.fp32_params.grad = self.fp32_params.data.new(total_param_size)
RuntimeError: CUDA error: out of memory
Traceback (most recent call last):
File "train.py", line 356, in <module>
distributed_main(args)
File "/home/ubuntu/github/fairseq/distributed_train.py", line 39, in main
single_process_main(args)
File "/home/ubuntu/github/fairseq/train.py", line 83, in main
trainer.dummy_train_step(dummy_batch)
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 343, in dummy_train_step
self.zero_grad()
File "/home/ubuntu/github/fairseq/fairseq/fp16_trainer.py", line 94, in zero_grad
self.optimizer.zero_grad() # FP32
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 68, in optimizer
self._build_optimizer()
File "/home/ubuntu/github/fairseq/fairseq/fp16_trainer.py", line 73, in _build_optimizer
self.fp32_params.grad = self.fp32_params.data.new(total_param_size)
RuntimeError: CUDA error: out of memory
| epoch 001: 0%| | 0/5492 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 356, in <module>
distributed_main(args)
File "/home/ubuntu/github/fairseq/distributed_train.py", line 39, in main
single_process_main(args)
File "/home/ubuntu/github/fairseq/train.py", line 95, in main
train(args, trainer, task, epoch_itr)
File "/home/ubuntu/github/fairseq/train.py", line 133, in train
log_output = trainer.train_step(sample, update_params=True)
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 144, in train_step
agg_logging_output = self._update_params()
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 163, in _update_params
(sample_sizes, logging_outputs, ooms_fwd, ooms_bwd)
File "/home/ubuntu/github/fairseq/fairseq/distributed_utils.py", line 77, in all_gather_list
torch.distributed.all_gather(out_buffers, in_buffer.cuda())
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/__init__.py", line 439, in all_gather
return all_gather_multigpu([tensor_list], [tensor], group)
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/__init__.py", line 413, in all_gather_multigpu
group)
RuntimeError: Connection reset by peer
Traceback (most recent call last):
File "train.py", line 356, in <module>
distributed_main(args)
File "/home/ubuntu/github/fairseq/distributed_train.py", line 39, in main
single_process_main(args)
File "/home/ubuntu/github/fairseq/train.py", line 95, in main
train(args, trainer, task, epoch_itr)
File "/home/ubuntu/github/fairseq/train.py", line 133, in train
log_output = trainer.train_step(sample, update_params=True)
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 144, in train_step
agg_logging_output = self._update_params()
File "/home/ubuntu/github/fairseq/fairseq/trainer.py", line 163, in _update_params
(sample_sizes, logging_outputs, ooms_fwd, ooms_bwd)
File "/home/ubuntu/github/fairseq/fairseq/distributed_utils.py", line 77, in all_gather_list
torch.distributed.all_gather(out_buffers, in_buffer.cuda())
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/__init__.py", line 439, in all_gather
return all_gather_multigpu([tensor_list], [tensor], group)
File "/home/ubuntu/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/__init__.py", line 413, in all_gather_multigpu
group)
RuntimeError: Connection reset by peer
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:11 (5 by maintainers)
Top Results From Across the Web
distributed training for transformer OOM · Issue #277 - GitHub
I'm attemping to do distributed training a big transformer model in fp16 using the following script. I receive CUDA out of memory issues....
Read more >Efficient Training on a Single GPU - Hugging Face
This guide focuses on training large models efficiently on a single GPU. These approaches are still valid if you have access to a...
Read more >Training Tips for the Transformer Model
We examine some of the critical parameters that affect the final translation quality, memory usage, training stability and training time, concluding each ...
Read more >Accelerated Training for Transformer-based Models on GPUs
Existing systems either only focus on model inference or optimization for only BERT-like encoder models. In this paper, we present LightSeq2, a system...
Read more >Advanced Model Training with Fully Sharded Data Parallel ...
Transformer Auto Wrap Policy. Mixed Precision ... Also, we cover specific features for Transformer based models. ... 1.4 Distributed training setup.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@edunov An update, with nccl debugging turned on, I can see this error now :
README was updated, closing for now.