Error about FP16 Trainng: FP16 mode cannot be used in single GPU training.
See original GitHub issueChecklist
- I have searched related issues but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug
I want to use the FP16 mode in the single GPU training on my GPU #1
(my computer has two GPUs).
BUT, I met some errors.
My config:
norm_cfg = dict(type="BN", requires_grad=True)
...
# optimizer
optimizer = dict(type="SGD", lr=0.01, momentum=0.9, weight_decay=0.0005)
# fp16 settings
optimizer_config = dict(type="Fp16OptimizerHook", loss_scale=512.0)
# learning policy
lr_config = dict(policy="poly", power=0.9, min_lr=1e-4, by_epoch=False)
# runtime settings
total_iters = 237500
checkpoint_config = dict(interval=5937, by_epoch=False)
evaluation = dict(interval=5937, metric="mIoU")
My commands:
CUDA_VISIBLE_DEVICES=1 python tools/train.py configs/remotesense/deeplabv3plus_r101-d8_256x256_fp16_e40_remotesense.py
# out
2020-09-09 10:40:09,478 - mmseg - INFO - workflow: [('train', 1)], max: 237500 iters
Traceback (most recent call last):
File "tools/train.py", line 161, in <module>
main()
File "tools/train.py", line 157, in main
meta=meta)
File "/home/lart/Coding/mmLib/segForRSIS/mmseg/apis/train.py", line 106, in train_segmentor
runner.run(data_loaders, cfg.workflow, cfg.total_iters)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 119, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
self.call_hook('after_train_iter')
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 298, in call_hook
getattr(hook, fn_name)(self)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 110, in after_train_iter
allreduce_grads(fp32_weights, self.coalesce, self.bucket_size_mb)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 236, in allreduce_grads
world_size = dist.get_world_size()
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 620, in get_world_size
return _get_group_size(group)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 219, in _get_group_size
_check_default_pg()
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized
CUDA_VISIBLE_DEVICES=1 python tools/train.py configs/remotesense/deeplabv3plus_r101-d8_256x256_fp16_e40_remotesense.py --gpu-ids 0
# out
2020-09-09 10:40:28,672 - mmseg - INFO - workflow: [('train', 1)], max: 237500 iters
Traceback (most recent call last):
File "tools/train.py", line 161, in <module>
main()
File "tools/train.py", line 157, in main
meta=meta)
File "/home/lart/Coding/mmLib/segForRSIS/mmseg/apis/train.py", line 106, in train_segmentor
runner.run(data_loaders, cfg.workflow, cfg.total_iters)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 119, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
self.call_hook('after_train_iter')
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 298, in call_hook
getattr(hook, fn_name)(self)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 110, in after_train_iter
allreduce_grads(fp32_weights, self.coalesce, self.bucket_size_mb)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 236, in allreduce_grads
world_size = dist.get_world_size()
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 620, in get_world_size
return _get_group_size(group)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 219, in _get_group_size
_check_default_pg()
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized
CUDA_VISIBLE_DEVICES=1 python tools/train.py configs/remotesense/deeplabv3plus_r101-d8_256x256_fp16_e40_remotesense.py --gpus 1
# out
2020-09-09 10:41:09,209 - mmseg - INFO - workflow: [('train', 1)], max: 237500 iters
Traceback (most recent call last):
File "tools/train.py", line 161, in <module>
main()
File "tools/train.py", line 157, in main
meta=meta)
File "/home/lart/Coding/mmLib/segForRSIS/mmseg/apis/train.py", line 106, in train_segmentor
runner.run(data_loaders, cfg.workflow, cfg.total_iters)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 119, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
self.call_hook('after_train_iter')
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 298, in call_hook
getattr(hook, fn_name)(self)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 110, in after_train_iter
allreduce_grads(fp32_weights, self.coalesce, self.bucket_size_mb)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 236, in allreduce_grads
world_size = dist.get_world_size()
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 620, in get_world_size
return _get_group_size(group)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 219, in _get_group_size
_check_default_pg()
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized
python tools/train.py configs/remotesense/deeplabv3plus_r101-d8_256x256_fp16_e40_remotesense.py --gpu-ids 1
# out
2020-09-09 10:43:53,705 - mmseg - INFO - workflow: [('train', 1)], max: 237500 iters
Traceback (most recent call last):
File "tools/train.py", line 161, in <module>
main()
File "tools/train.py", line 157, in main
meta=meta)
File "/home/lart/Coding/mmLib/segForRSIS/mmseg/apis/train.py", line 106, in train_segmentor
runner.run(data_loaders, cfg.workflow, cfg.total_iters)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 119, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
self.call_hook('after_train_iter')
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 298, in call_hook
getattr(hook, fn_name)(self)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 110, in after_train_iter
allreduce_grads(fp32_weights, self.coalesce, self.bucket_size_mb)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 236, in allreduce_grads
world_size = dist.get_world_size()
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 620, in get_world_size
return _get_group_size(group)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 219, in _get_group_size
_check_default_pg()
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized
python tools/train.py configs/remotesense/deeplabv3plus_r101-d8_256x256_fp16_e40_remotesense.py --gpu-ids 0
# out
2020-09-09 10:46:45,033 - mmseg - INFO - workflow: [('train', 1)], max: 237500 iters
Traceback (most recent call last):
File "tools/train.py", line 161, in <module>
main()
File "tools/train.py", line 157, in main
meta=meta)
File "/home/lart/Coding/mmLib/segForRSIS/mmseg/apis/train.py", line 106, in train_segmentor
runner.run(data_loaders, cfg.workflow, cfg.total_iters)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 119, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
self.call_hook('after_train_iter')
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 298, in call_hook
getattr(hook, fn_name)(self)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 110, in after_train_iter
allreduce_grads(fp32_weights, self.coalesce, self.bucket_size_mb)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 236, in allreduce_grads
world_size = dist.get_world_size()
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 620, in get_world_size
return _get_group_size(group)
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 219, in _get_group_size
_check_default_pg()
File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized
However, if I use the following command, it works:
CUDA_VISIBLE_DEVICES=1 ./tools/dist_train.sh configs/remotesense/deeplabv3plus_r101-d8_256x256_fp16_e40_remotesense.py 1
In addition, I want to know, in this case (single GPU training with dist_train.sh
on GPU #1
), do I need to use SyncBN?
Environment
- Please run
python mmseg/utils/collect_env.py
to collect necessary environment infomation and paste it here.
➜ python mmseg/utils/collect_env.py
sys.platform: linux
Python: 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GPU 0,1: GeForce RTX 2080 Ti
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.6.0
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.2
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.5
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.7.0
OpenCV: 4.4.0
MMCV: 1.1.2
MMSegmentation: 0.5.1+ff98229
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
fp16 (half precision) training doesn't work with 2 or more ...
On 2 gpu's, 32 bit training still works fine, but 16 bit training broken. Training become unstable or results in slower learning curves....
Read more >Train With Mixed Precision
Using mixed precision training requires two steps: Porting the model to use the FP16 data type where appropriate.
Read more >Known issues & nuances with multi-GPU & fp16 training?
Been experimenting with fp16 and multi-GPU training, and I've observed ... Error message indicates data may not be sent to the right device?...
Read more >Efficient Training on a Single GPU
This guide focuses on training large models efficiently on a single GPU. These approaches are still valid if you have access to a...
Read more >How to Enable Mixed precision training - tensorflow
The above error suggests that it does not accept fp16=True/bf16=True in non-GPU mode. Perhaps Cuda 11.6 might be an issue here which has ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@lartpang @xvjiarui Hi,This is not a bug of mmcv, you should explicitly set distributed to false, distributed to True by default.
optimizer_config = dict(type=“Fp16OptimizerHook”, loss_scale=512.0, distributed=False)
@ChaofWang Maybe, you are right…
https://github.com/open-mmlab/mmcv/blob/49fdf3cfa082bf2a5760bbb076f836224f4bec2c/mmcv/runner/hooks/optimizer.py#L35-L61
Unfortunately, this is not mentioned in the document.