training with single GPU
See original GitHub issueHi,
So while trying to train the network I encountered this error. I can’t figure out what the mistake is. I’m using the proper pytorch commit. I have not made any modifications to the code.
From my terminal:
CUDA_VISIBLE_DEVICES=0 python train.py --dataset cityscapes --model danet --backbone resnet101 --checkname danet101 --base-size 1024 --crop-size 768 --epochs 240 --batch-size 8 --lr 0.003 --workers 2 --multi-grid --multi-dilation 4 8 16
Error:
Traceback (most recent call last):
File "train.py", line 201, in <module>
trainer.training(epoch)
File "train.py", line 125, in training
outputs = self.model(image)
File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
return self.module(*inputs[0], **kwargs[0])
File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/encoding/models/danet.py", line 45, in forward
_, _, c3, c4 = self.base_forward(x)
File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/encoding/models/base.py", line 58, in base_forward
x = self.pretrained.bn1(x)
File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/datadrive/virtualenvs/torchDA/lib/python3.6/site-packages/encoding/nn/syncbn.py", line 57, in forward
mean, inv_std = self._slave_pipe.run_slave(_ChildMessage(xsum, xsqsum, N))
AttributeError: 'NoneType' object has no attribute 'run_slave' ```
Issue Analytics
- State:
- Created 5 years ago
- Comments:19
Top Results From Across the Web
Efficient Training on a Single GPU - Hugging Face
This guide focuses on training large models efficiently on a single GPU. These approaches are still valid if you have access to a...
Read more >Single Node, Single GPU Training - Flyte
Training a model on a single node on one GPU is as trivial as writing any Flyte task and simply setting the GPU...
Read more >Train 18-billion-parameter GPT models with a single GPU on ...
Now, a PC with only one GPU can train GPT with up to 18 billion parameters, and a laptop can also train a...
Read more >Embedding Training With 1% GPU Memory and 100 Times ...
Hybrid Training: This method starts by splitting the embedding table into two parts, one trained on the GPU and the other trained on...
Read more >6-3 Model Training Using Single GPU
6-3 Model Training Using Single GPU# · 1. GPU Configuration#. gpus = tf.config.list_physical_devices("GPU") if gpus: gpu0 = gpus[0] # Only use GPU 0...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hello, everyone! I have met this problem and solved it as follows: in DANet/encoding/nn/syncbn.py line 36 self._parallel_id = None You need to set self._parallel_id=0 if you just have one GPU. You can see line 61 data_parallel_replicate function: def data_parallel_replicate(self, ctx, copy_id): self._parallel_id = copy_id
For eval_batch in train.py , comment the multi-gpu data collection line as follows: