question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error about FP16 Trainng: FP16 mode cannot be used in single GPU training.

See original GitHub issue

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug

I want to use the FP16 mode in the single GPU training on my GPU #1 (my computer has two GPUs). BUT, I met some errors.

My config:

norm_cfg = dict(type="BN", requires_grad=True)
...
# optimizer
optimizer = dict(type="SGD", lr=0.01, momentum=0.9, weight_decay=0.0005)
# fp16 settings
optimizer_config = dict(type="Fp16OptimizerHook", loss_scale=512.0)

# learning policy
lr_config = dict(policy="poly", power=0.9, min_lr=1e-4, by_epoch=False)
# runtime settings
total_iters = 237500

checkpoint_config = dict(interval=5937, by_epoch=False)
evaluation = dict(interval=5937, metric="mIoU")

My commands:

CUDA_VISIBLE_DEVICES=1 python tools/train.py configs/remotesense/deeplabv3plus_r101-d8_256x256_fp16_e40_remotesense.py

# out
2020-09-09 10:40:09,478 - mmseg - INFO - workflow: [('train', 1)], max: 237500 iters
Traceback (most recent call last):
  File "tools/train.py", line 161, in <module>
    main()
  File "tools/train.py", line 157, in main
    meta=meta)
  File "/home/lart/Coding/mmLib/segForRSIS/mmseg/apis/train.py", line 106, in train_segmentor
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 119, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
    self.call_hook('after_train_iter')
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 298, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 110, in after_train_iter
    allreduce_grads(fp32_weights, self.coalesce, self.bucket_size_mb)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 236, in allreduce_grads
    world_size = dist.get_world_size()
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 620, in get_world_size
    return _get_group_size(group)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 219, in _get_group_size
    _check_default_pg()
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

CUDA_VISIBLE_DEVICES=1 python tools/train.py configs/remotesense/deeplabv3plus_r101-d8_256x256_fp16_e40_remotesense.py --gpu-ids 0

# out
2020-09-09 10:40:28,672 - mmseg - INFO - workflow: [('train', 1)], max: 237500 iters
Traceback (most recent call last):
  File "tools/train.py", line 161, in <module>
    main()
  File "tools/train.py", line 157, in main
    meta=meta)
  File "/home/lart/Coding/mmLib/segForRSIS/mmseg/apis/train.py", line 106, in train_segmentor
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 119, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
    self.call_hook('after_train_iter')
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 298, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 110, in after_train_iter
    allreduce_grads(fp32_weights, self.coalesce, self.bucket_size_mb)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 236, in allreduce_grads
    world_size = dist.get_world_size()
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 620, in get_world_size
    return _get_group_size(group)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 219, in _get_group_size
    _check_default_pg()
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

CUDA_VISIBLE_DEVICES=1 python tools/train.py configs/remotesense/deeplabv3plus_r101-d8_256x256_fp16_e40_remotesense.py --gpus 1

# out
2020-09-09 10:41:09,209 - mmseg - INFO - workflow: [('train', 1)], max: 237500 iters
Traceback (most recent call last):
  File "tools/train.py", line 161, in <module>
    main()
  File "tools/train.py", line 157, in main
    meta=meta)
  File "/home/lart/Coding/mmLib/segForRSIS/mmseg/apis/train.py", line 106, in train_segmentor
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 119, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
    self.call_hook('after_train_iter')
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 298, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 110, in after_train_iter
    allreduce_grads(fp32_weights, self.coalesce, self.bucket_size_mb)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 236, in allreduce_grads
    world_size = dist.get_world_size()
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 620, in get_world_size
    return _get_group_size(group)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 219, in _get_group_size
    _check_default_pg()
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

python tools/train.py configs/remotesense/deeplabv3plus_r101-d8_256x256_fp16_e40_remotesense.py --gpu-ids 1
# out
2020-09-09 10:43:53,705 - mmseg - INFO - workflow: [('train', 1)], max: 237500 iters
Traceback (most recent call last):
  File "tools/train.py", line 161, in <module>
    main()
  File "tools/train.py", line 157, in main
    meta=meta)
  File "/home/lart/Coding/mmLib/segForRSIS/mmseg/apis/train.py", line 106, in train_segmentor
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 119, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
    self.call_hook('after_train_iter')
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 298, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 110, in after_train_iter
    allreduce_grads(fp32_weights, self.coalesce, self.bucket_size_mb)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 236, in allreduce_grads
    world_size = dist.get_world_size()
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 620, in get_world_size
    return _get_group_size(group)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 219, in _get_group_size
    _check_default_pg()
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

python tools/train.py configs/remotesense/deeplabv3plus_r101-d8_256x256_fp16_e40_remotesense.py --gpu-ids 0
# out
2020-09-09 10:46:45,033 - mmseg - INFO - workflow: [('train', 1)], max: 237500 iters
Traceback (most recent call last):
  File "tools/train.py", line 161, in <module>
    main()
  File "tools/train.py", line 157, in main
    meta=meta)
  File "/home/lart/Coding/mmLib/segForRSIS/mmseg/apis/train.py", line 106, in train_segmentor
    runner.run(data_loaders, cfg.workflow, cfg.total_iters)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 119, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
    self.call_hook('after_train_iter')
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 298, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 110, in after_train_iter
    allreduce_grads(fp32_weights, self.coalesce, self.bucket_size_mb)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 236, in allreduce_grads
    world_size = dist.get_world_size()
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 620, in get_world_size
    return _get_group_size(group)
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 219, in _get_group_size
    _check_default_pg()
  File "/home/lart/miniconda3/envs/mmseg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

However, if I use the following command, it works:

 CUDA_VISIBLE_DEVICES=1 ./tools/dist_train.sh configs/remotesense/deeplabv3plus_r101-d8_256x256_fp16_e40_remotesense.py 1

In addition, I want to know, in this case (single GPU training with dist_train.sh on GPU #1), do I need to use SyncBN?

Environment

  1. Please run python mmseg/utils/collect_env.py to collect necessary environment infomation and paste it here.
➜ python mmseg/utils/collect_env.py
sys.platform: linux
Python: 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GPU 0,1: GeForce RTX 2080 Ti
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.6.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.7.0
OpenCV: 4.4.0
MMCV: 1.1.2
MMSegmentation: 0.5.1+ff98229
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
ChaofWangcommented, Sep 14, 2020

Got it. I noticed that the error is raised in mmcv FP16Optimizer. We may support it in the future.

For now, you may use distributed training instead.

@lartpang @xvjiarui Hi,This is not a bug of mmcv, you should explicitly set distributed to false, distributed to True by default.

optimizer_config = dict(type=“Fp16OptimizerHook”, loss_scale=512.0, distributed=False)

0reactions
lartpangcommented, Sep 14, 2020
Read more comments on GitHub >

github_iconTop Results From Across the Web

fp16 (half precision) training doesn't work with 2 or more ...
On 2 gpu's, 32 bit training still works fine, but 16 bit training broken. Training become unstable or results in slower learning curves....
Read more >
Train With Mixed Precision
Using mixed precision training requires two steps: Porting the model to use the FP16 data type where appropriate.
Read more >
Known issues & nuances with multi-GPU & fp16 training?
Been experimenting with fp16 and multi-GPU training, and I've observed ... Error message indicates data may not be sent to the right device?...
Read more >
Efficient Training on a Single GPU
This guide focuses on training large models efficiently on a single GPU. These approaches are still valid if you have access to a...
Read more >
How to Enable Mixed precision training - tensorflow
The above error suggests that it does not accept fp16=True/bf16=True in non-GPU mode. Perhaps Cuda 11.6 might be an issue here which has ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found