question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error when training on multi-GPU

See original GitHub issue

When I training on multi-GPU a got this error, but when I training on single-GPU that error will not appear. ValueError: gather got an input of invalid size: got 10x110x29, but expected 10x226x29

python -m multiproc train.py --train-manifest qkids/manifest/qkids_train_manifest_limit_250.csv --val-manifest qkids/manifest/qkids_test_manifest_limit_never_train.csv --cuda --model-path models/libri_final_and_limit.pth --epochs 50 --checkpoint --checkpoint-per-batch 1000 --batch-size 20 [‘train.py’, ‘–train-manifest’, ‘qkids/manifest/qkids_train_manifest_limit_250.csv’, ‘–val-manifest’, ‘qkids/manifest/qkids_test_manifest_limit_never_train.csv’, ‘–cuda’, ‘–model-path’, ‘models/libri_final_and_limit.pth’, ‘–epochs’, ‘50’, ‘–checkpoint’, ‘–checkpoint-per-batch’, ‘1000’, ‘–batch-size’, ‘20’, ‘–world-size’, ‘2’, ‘–rank’, ‘0’, ‘–gpu-rank’, ‘0’] [‘train.py’, ‘–train-manifest’, ‘qkids/manifest/qkids_train_manifest_limit_250.csv’, ‘–val-manifest’, ‘qkids/manifest/qkids_test_manifest_limit_never_train.csv’, ‘–cuda’, ‘–model-path’, ‘models/libri_final_and_limit.pth’, ‘–epochs’, ‘50’, ‘–checkpoint’, ‘–checkpoint-per-batch’, ‘1000’, ‘–batch-size’, ‘20’, ‘–world-size’, ‘2’, ‘–rank’, ‘1’, ‘–gpu-rank’, ‘1’] DistributedDataParallel( (module): DeepSpeech( (conv): MaskConv( (seq_module): Sequential( (0): Conv2d(1, 32, kernel_size=(41, 11), stride=(2, 2), padding=(20, 5)) (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): Hardtanh(min_val=0, max_val=20, inplace) (3): Conv2d(32, 32, kernel_size=(21, 11), stride=(2, 1), padding=(10, 5)) (4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): Hardtanh(min_val=0, max_val=20, inplace) ) ) (rnns): Sequential( (0): BatchRNN( (rnn): GRU(1312, 800, bidirectional=True) ) (1): BatchRNN( (batch_norm): SequenceWise ( BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)) (rnn): GRU(800, 800, bidirectional=True) ) (2): BatchRNN( (batch_norm): SequenceWise ( BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)) (rnn): GRU(800, 800, bidirectional=True) ) (3): BatchRNN( (batch_norm): SequenceWise ( BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)) (rnn): GRU(800, 800, bidirectional=True) ) (4): BatchRNN( (batch_norm): SequenceWise ( BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)) (rnn): GRU(800, 800, bidirectional=True) ) ) (fc): Sequential( (0): SequenceWise ( Sequential( (0): BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (1): Linear(in_features=800, out_features=29, bias=False) )) ) (inference_softmax): InferenceBatchSoftmax() ) ) Number of parameters: 41187968 /home/luozhiping/workspace/speech/deepspeech.pytorch/model_new.py:98: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters(). x, h = self.rnn(x) Traceback (most recent call last): File “train.py”, line 248, in <module> out, output_sizes = model(inputs, input_sizes) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 491, in call result = self.forward(*input, **kwargs) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py”, line 217, in forward return self.gather(outputs, self.output_device) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py”, line 226, in gather return gather(outputs, output_device, dim=self.dim) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py”, line 68, in gather return gather_map(outputs) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py”, line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py”, line 55, in gather_map return Gather.apply(target_device, dim, *outputs) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py”, line 55, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py”, line 186, in gather “but expected {}”.format(got, expected)) ValueError: gather got an input of invalid size: got 10x110x29, but expected 10x226x29 terminate called after throwing an instance of ‘gloo::EnforceNotMet’ what(): [enforce fail at /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda.cu:249] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda.cu:249: driver shutting down

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:3
  • Comments:15

github_iconTop GitHub Comments

7reactions
dmckinney5commented, Nov 15, 2018

If you follow the steps from @slavaGanzin to get past the invalid size error you can get past the assertion error by putting output_lengths on gpu in DeepSpeech:

def forward(self, x, lengths):
    lengths = lengths.cpu().int()
    output_lengths = self.get_seq_lens(lengths)
+   output_lengths = output_lengths.cuda()

You will need to ensure the output length returned (into output_sizes) is back on cpu for ctc in the training loop:

loss = criterion(out, targets, output_sizes.cpu(), target_sizes)

These steps allowed me to run on multi-gpu seemingly without issue.

1reaction
slavaGanzincommented, Aug 14, 2018

You should use total_length argument:

        total_length = x.size(0)
        x = nn.utils.rnn.pack_padded_sequence(x, output_lengths)
        x, h = self.rnn(x)
        x, _ = nn.utils.rnn.pad_packed_sequence(x, total_length=total_length)

https://pytorch.org/docs/stable/notes/faq.html#my-recurrent-network-doesn-t-work-with-data-parallelism

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi-GPU Training error #2461 - ultralytics/yolov5 - GitHub
Multi-GPU Training: python -m torch.distributed.launch --master_port 42342 ... I got the error: Tensors must be CUDA and dense When I set ...
Read more >
Problems with multi-gpus - MATLAB Answers - MathWorks
Learn more about multi gpus. ... no problem training with a single gpu, but when I try to train with multiple gpus, matlab...
Read more >
Training with multiple GPUs has error using TAO toolkit
I am using the following command to train maskrcnn. If I set --gpus 1 , it is fine. If I set 4, I...
Read more >
Error occurs when saving model in multi-gpu settings
Currently, I'm using accelerate library to do the training in multi-gpu settings. And the relevant code for saving the model is as follows:...
Read more >
Multi-GPU training crashes after some time due to NVLink ...
nvidia-smi lists the GPU as “GPU is lost”, syslog shows Xid error 74, which according to Nvidia documentation relates to fatal NVLink error...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found