Facing NCCL error on Multi-GPU training(on single machine) using run_glue.py script
See original GitHub issueEnvironment info
transformers
version: 4.3.2- Platform: Linux-4.19.0-14-cloud-amd64-x86_64-with-debian-buster-sid
- Python version: 3.7.9
- PyTorch version (GPU?): 1.7.0 (True)
- Tensorflow version (GPU?): 2.4.1 (True)
- Using GPU in script?: 4xTesla T4 (GCP)
- Using distributed or parallel set-up in script?: torch.distributed.launch
Who can help
Information
Model I am using (Bert, XLNet …): DistilRoberta
The problem arises when using:
- [*] the official example scripts: (give details below)
- [ ] my own modified scripts: (give details below)
The tasks I am working on is:
- [ ] an official GLUE/SQUaD task: (give the name)
- [*] my own task or dataset: (give details below)
Regression task with a single output, using BertForSequenceClassification
To reproduce
Steps to reproduce the behavior:
1.python -m torch.distributed.launch --nproc_per_node 4 /home/run_glue.py --train_file /home/data/train.csv --validation_file /home/data/dev.csv --test_file /home/data/test.csv --model_name_or_path distilroberta-base --output_dir /home/model --num_train_epochs 5 --per_device_train_batch_size 1 --per_device_eval_batch_size 16 --do_train --do_eval --fp16 --gradient_accumulation_steps 2 --do_predict --logging_steps 100 --evaluation_strategy steps --save_steps 100 --overwrite_output_dir
File “/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 442, in init_process_group 732793de051f:1895:1925 [1] NCCL INFO transport/shm.cc:101 -> 2 732793de051f:1895:1925 [1] NCCL INFO transport.cc:30 -> 2 732793de051f:1895:1925 [1] NCCL INFO transport.cc:49 -> 2 732793de051f:1895:1925 [1] NCCL INFO init.cc:766 -> 2 732793de051f:1895:1925 [1] NCCL INFO init.cc:840 -> 2 732793de051f:1895:1925 [1] NCCL INFO group.cc:73 -> 2 [Async thread] barrier() File “/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 1947, in barrier Traceback (most recent call last): File “/home/run_text_classification.py”, line 480, in <module> work = _default_pg.barrier() RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729138878/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8 main() File “/home/run_text_classification.py”, line 163, in main model_args, data_args, training_args = parser.parse_args_into_dataclasses() File “/opt/conda/lib/python3.7/site-packages/transformers/hf_argparser.py”, line 180, in parse_args_into_dataclasses obj = dtype(**inputs) File “<string>”, line 60, in init File “/opt/conda/lib/python3.7/site-packages/transformers/training_args.py”, line 478, in post_init if is_torch_available() and self.device.type != “cuda” and self.fp16: File “/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py”, line 1346, in wrapper return func(*args, **kwargs) File “/opt/conda/lib/python3.7/site-packages/transformers/training_args.py”, line 583, in device return self._setup_devices
732793de051f:1897:1927 [3] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device File “/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py”, line 1336, in get 732793de051f:1897:1927 [3] NCCL INFO include/shm.h:41 -> 2
732793de051f:1897:1927 [3] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-b3d54cebe4167a34-0-2-3 (size 9637888)
Expected behavior
Expected model training to proceed smoothly using 4xGPU. When I run the said script with nproc_per_node=1(or even 2), it runs smoothly but setting it as 4 gives strange errors.
After updating to 1.9.0 I face a different error:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:832, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (3 by maintainers)
Top GitHub Comments
Thanks for the quick reply. Yeah, it’s strange that it works on 2 GPUs but not on 4. Will check again and let you know.
Hi @sgugger, Good news, the issue seems to have been an environment issue. Thanks for the instant help