question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

training with single GPU

See original GitHub issue

Hi,

So while trying to train the network I encountered this error. I can’t figure out what the mistake is. I’m using the proper pytorch commit. I have not made any modifications to the code.

From my terminal: CUDA_VISIBLE_DEVICES=0 python train.py --dataset cityscapes --model danet --backbone resnet101 --checkname danet101 --base-size 1024 --crop-size 768 --epochs 240 --batch-size 8 --lr 0.003 --workers 2 --multi-grid --multi-dilation 4 8 16

Error:

Traceback (most recent call last):
  File "train.py", line 201, in <module>
    trainer.training(epoch)
  File "train.py", line 125, in training
    outputs = self.model(image)
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/encoding/models/danet.py", line 45, in forward
    _, _, c3, c4 = self.base_forward(x)
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/encoding/models/base.py", line 58, in base_forward
    x = self.pretrained.bn1(x)
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/encoding/nn/syncbn.py", line 57, in forward
    mean, inv_std = self._slave_pipe.run_slave(_ChildMessage(xsum, xsqsum, N))
AttributeError: 'NoneType' object has no attribute 'run_slave' ```

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:19

github_iconTop GitHub Comments

4reactions
yougoforwardcommented, Mar 21, 2019

Hello, everyone! I have met this problem and solved it as follows: in DANet/encoding/nn/syncbn.py line 36 self._parallel_id = None You need to set self._parallel_id=0 if you just have one GPU. You can see line 61 data_parallel_replicate function: def data_parallel_replicate(self, ctx, copy_id): self._parallel_id = copy_id

    # parallel_id == 0 means master device.
    if self._parallel_id == 0:
        ctx.sync_master = self._sync_master
    else:
        self._slave_pipe = ctx.sync_master.register_slave(copy_id)
3reactions
yougoforwardcommented, Apr 23, 2019

For eval_batch in train.py , comment the multi-gpu data collection line as follows:

    def eval_batch(model, image, target):
        outputs = model(image)
        # outputs = gather(outputs, 0, dim=0)
        pred = outputs[0]
        target = target.cuda()
        correct, labeled = utils.batch_pix_accuracy(pred.data, target)
        inter, union = utils.batch_intersection_union(pred.data, target, self.nclass)
        return correct, labeled, inter, union
Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficient Training on a Single GPU - Hugging Face
This guide focuses on training large models efficiently on a single GPU. These approaches are still valid if you have access to a...
Read more >
Single Node, Single GPU Training - Flyte
Training a model on a single node on one GPU is as trivial as writing any Flyte task and simply setting the GPU...
Read more >
Train 18-billion-parameter GPT models with a single GPU on ...
Now, a PC with only one GPU can train GPT with up to 18 billion parameters, and a laptop can also train a...
Read more >
Embedding Training With 1% GPU Memory and 100 Times ...
Hybrid Training: This method starts by splitting the embedding table into two parts, one trained on the GPU and the other trained on...
Read more >
6-3 Model Training Using Single GPU
6-3 Model Training Using Single GPU# · 1. GPU Configuration#. gpus = tf.config.list_physical_devices("GPU") if gpus: gpu0 = gpus[0] # Only use GPU 0...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found