Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

training with single GPU

See original GitHub issue

Hi,

So while trying to train the network I encountered this error. I can’t figure out what the mistake is. I’m using the proper pytorch commit. I have not made any modifications to the code.

From my terminal: CUDA_VISIBLE_DEVICES=0 python train.py --dataset cityscapes --model danet --backbone resnet101 --checkname danet101 --base-size 1024 --crop-size 768 --epochs 240 --batch-size 8 --lr 0.003 --workers 2 --multi-grid --multi-dilation 4 8 16

Error:

Traceback (most recent call last):
  File "train.py", line 201, in <module>
    trainer.training(epoch)
  File "train.py", line 125, in training
    outputs = self.model(image)
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/encoding/models/danet.py", line 45, in forward
    _, _, c3, c4 = self.base_forward(x)
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/encoding/models/base.py", line 58, in base_forward
    x = self.pretrained.bn1(x)
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/encoding/nn/syncbn.py", line 57, in forward
    mean, inv_std = self._slave_pipe.run_slave(_ChildMessage(xsum, xsqsum, N))
AttributeError: 'NoneType' object has no attribute 'run_slave' ```

Issue Analytics

State:
Created 5 years ago
Comments:19

Top GitHub Comments

4reactions

yougoforwardcommented, Mar 21, 2019

Hello, everyone! I have met this problem and solved it as follows: in DANet/encoding/nn/syncbn.py line 36 self._parallel_id = None You need to set self._parallel_id=0 if you just have one GPU. You can see line 61 data_parallel_replicate function: def data_parallel_replicate(self, ctx, copy_id): self._parallel_id = copy_id

    # parallel_id == 0 means master device.
    if self._parallel_id == 0:
        ctx.sync_master = self._sync_master
    else:
        self._slave_pipe = ctx.sync_master.register_slave(copy_id)

3reactions

yougoforwardcommented, Apr 23, 2019

For eval_batch in train.py , comment the multi-gpu data collection line as follows:

    def eval_batch(model, image, target):
        outputs = model(image)
        # outputs = gather(outputs, 0, dim=0)
        pred = outputs[0]
        target = target.cuda()
        correct, labeled = utils.batch_pix_accuracy(pred.data, target)
        inter, union = utils.batch_intersection_union(pred.data, target, self.nclass)
        return correct, labeled, inter, union

Top Results From Across the Web

Efficient Training on a Single GPU - Hugging Face

This guide focuses on training large models efficiently on a single GPU. These approaches are still valid if you have access to a...

Single Node, Single GPU Training - Flyte

Training a model on a single node on one GPU is as trivial as writing any Flyte task and simply setting the GPU...

Train 18-billion-parameter GPT models with a single GPU on ...

Now, a PC with only one GPU can train GPT with up to 18 billion parameters, and a laptop can also train a...

Embedding Training With 1% GPU Memory and 100 Times ...

Hybrid Training: This method starts by splitting the embedding table into two parts, one trained on the GPU and the other trained on...

6-3 Model Training Using Single GPU

6-3 Model Training Using Single GPU# · 1. GPU Configuration#. gpus = tf.config.list_physical_devices("GPU") if gpus: gpu0 = gpus[0] # Only use GPU 0...