Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error when training on multi-GPU

See original GitHub issue

When I training on multi-GPU a got this error, but when I training on single-GPU that error will not appear. ValueError: gather got an input of invalid size: got 10x110x29, but expected 10x226x29

python -m multiproc train.py --train-manifest qkids/manifest/qkids_train_manifest_limit_250.csv --val-manifest qkids/manifest/qkids_test_manifest_limit_never_train.csv --cuda --model-path models/libri_final_and_limit.pth --epochs 50 --checkpoint --checkpoint-per-batch 1000 --batch-size 20 [‘train.py’, ‘–train-manifest’, ‘qkids/manifest/qkids_train_manifest_limit_250.csv’, ‘–val-manifest’, ‘qkids/manifest/qkids_test_manifest_limit_never_train.csv’, ‘–cuda’, ‘–model-path’, ‘models/libri_final_and_limit.pth’, ‘–epochs’, ‘50’, ‘–checkpoint’, ‘–checkpoint-per-batch’, ‘1000’, ‘–batch-size’, ‘20’, ‘–world-size’, ‘2’, ‘–rank’, ‘0’, ‘–gpu-rank’, ‘0’] [‘train.py’, ‘–train-manifest’, ‘qkids/manifest/qkids_train_manifest_limit_250.csv’, ‘–val-manifest’, ‘qkids/manifest/qkids_test_manifest_limit_never_train.csv’, ‘–cuda’, ‘–model-path’, ‘models/libri_final_and_limit.pth’, ‘–epochs’, ‘50’, ‘–checkpoint’, ‘–checkpoint-per-batch’, ‘1000’, ‘–batch-size’, ‘20’, ‘–world-size’, ‘2’, ‘–rank’, ‘1’, ‘–gpu-rank’, ‘1’] DistributedDataParallel( (module): DeepSpeech( (conv): MaskConv( (seq_module): Sequential( (0): Conv2d(1, 32, kernel_size=(41, 11), stride=(2, 2), padding=(20, 5)) (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): Hardtanh(min_val=0, max_val=20, inplace) (3): Conv2d(32, 32, kernel_size=(21, 11), stride=(2, 1), padding=(10, 5)) (4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): Hardtanh(min_val=0, max_val=20, inplace) ) ) (rnns): Sequential( (0): BatchRNN( (rnn): GRU(1312, 800, bidirectional=True) ) (1): BatchRNN( (batch_norm): SequenceWise ( BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)) (rnn): GRU(800, 800, bidirectional=True) ) (2): BatchRNN( (batch_norm): SequenceWise ( BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)) (rnn): GRU(800, 800, bidirectional=True) ) (3): BatchRNN( (batch_norm): SequenceWise ( BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)) (rnn): GRU(800, 800, bidirectional=True) ) (4): BatchRNN( (batch_norm): SequenceWise ( BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)) (rnn): GRU(800, 800, bidirectional=True) ) ) (fc): Sequential( (0): SequenceWise ( Sequential( (0): BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (1): Linear(in_features=800, out_features=29, bias=False) )) ) (inference_softmax): InferenceBatchSoftmax() ) ) Number of parameters: 41187968 /home/luozhiping/workspace/speech/deepspeech.pytorch/model_new.py:98: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters(). x, h = self.rnn(x) Traceback (most recent call last): File “train.py”, line 248, in <module> out, output_sizes = model(inputs, input_sizes) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 491, in call result = self.forward(*input, **kwargs) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py”, line 217, in forward return self.gather(outputs, self.output_device) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py”, line 226, in gather return gather(outputs, output_device, dim=self.dim) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py”, line 68, in gather return gather_map(outputs) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py”, line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py”, line 55, in gather_map return Gather.apply(target_device, dim, *outputs) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py”, line 55, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File “/home/luozhiping/anaconda3/lib/python3.6/site-packages/torch/cuda/comm.py”, line 186, in gather “but expected {}”.format(got, expected)) ValueError: gather got an input of invalid size: got 10x110x29, but expected 10x226x29 terminate called after throwing an instance of ‘gloo::EnforceNotMet’ what(): [enforce fail at /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda.cu:249] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda.cu:249: driver shutting down

Issue Analytics

State:
Created 5 years ago
Reactions:3
Comments:15

Top GitHub Comments

7reactions

dmckinney5commented, Nov 15, 2018

If you follow the steps from @slavaGanzin to get past the invalid size error you can get past the assertion error by putting output_lengths on gpu in DeepSpeech:

def forward(self, x, lengths):
    lengths = lengths.cpu().int()
    output_lengths = self.get_seq_lens(lengths)
+   output_lengths = output_lengths.cuda()

You will need to ensure the output length returned (into output_sizes) is back on cpu for ctc in the training loop:

loss = criterion(out, targets, output_sizes.cpu(), target_sizes)

These steps allowed me to run on multi-gpu seemingly without issue.

1reaction

slavaGanzincommented, Aug 14, 2018

You should use total_length argument:

        total_length = x.size(0)
        x = nn.utils.rnn.pack_padded_sequence(x, output_lengths)
        x, h = self.rnn(x)
        x, _ = nn.utils.rnn.pad_packed_sequence(x, total_length=total_length)

https://pytorch.org/docs/stable/notes/faq.html#my-recurrent-network-doesn-t-work-with-data-parallelism