question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Facing NCCL error on Multi-GPU training(on single machine) using run_glue.py script

See original GitHub issue

Environment info

  • transformers version: 4.3.2
  • Platform: Linux-4.19.0-14-cloud-amd64-x86_64-with-debian-buster-sid
  • Python version: 3.7.9
  • PyTorch version (GPU?): 1.7.0 (True)
  • Tensorflow version (GPU?): 2.4.1 (True)
  • Using GPU in script?: 4xTesla T4 (GCP)
  • Using distributed or parallel set-up in script?: torch.distributed.launch

Who can help

Information

Model I am using (Bert, XLNet …): DistilRoberta

The problem arises when using:

  • [*] the official example scripts: (give details below)
  • [ ] my own modified scripts: (give details below)

The tasks I am working on is:

  • [ ] an official GLUE/SQUaD task: (give the name)
  • [*] my own task or dataset: (give details below)

Regression task with a single output, using BertForSequenceClassification

To reproduce

Steps to reproduce the behavior:

1.python -m torch.distributed.launch --nproc_per_node 4 /home/run_glue.py --train_file /home/data/train.csv --validation_file /home/data/dev.csv --test_file /home/data/test.csv --model_name_or_path distilroberta-base --output_dir /home/model --num_train_epochs 5 --per_device_train_batch_size 1 --per_device_eval_batch_size 16 --do_train --do_eval --fp16 --gradient_accumulation_steps 2 --do_predict --logging_steps 100 --evaluation_strategy steps --save_steps 100 --overwrite_output_dir

File “/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 442, in init_process_group 732793de051f:1895:1925 [1] NCCL INFO transport/shm.cc:101 -> 2 732793de051f:1895:1925 [1] NCCL INFO transport.cc:30 -> 2 732793de051f:1895:1925 [1] NCCL INFO transport.cc:49 -> 2 732793de051f:1895:1925 [1] NCCL INFO init.cc:766 -> 2 732793de051f:1895:1925 [1] NCCL INFO init.cc:840 -> 2 732793de051f:1895:1925 [1] NCCL INFO group.cc:73 -> 2 [Async thread] barrier() File “/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 1947, in barrier Traceback (most recent call last): File “/home/run_text_classification.py”, line 480, in <module> work = _default_pg.barrier() RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729138878/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8 main() File “/home/run_text_classification.py”, line 163, in main model_args, data_args, training_args = parser.parse_args_into_dataclasses() File “/opt/conda/lib/python3.7/site-packages/transformers/hf_argparser.py”, line 180, in parse_args_into_dataclasses obj = dtype(**inputs) File “<string>”, line 60, in init File “/opt/conda/lib/python3.7/site-packages/transformers/training_args.py”, line 478, in post_init if is_torch_available() and self.device.type != “cuda” and self.fp16: File “/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py”, line 1346, in wrapper return func(*args, **kwargs) File “/opt/conda/lib/python3.7/site-packages/transformers/training_args.py”, line 583, in device return self._setup_devices

732793de051f:1897:1927 [3] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device File “/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py”, line 1336, in get 732793de051f:1897:1927 [3] NCCL INFO include/shm.h:41 -> 2

732793de051f:1897:1927 [3] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-b3d54cebe4167a34-0-2-3 (size 9637888)

Expected behavior

Expected model training to proceed smoothly using 4xGPU. When I run the said script with nproc_per_node=1(or even 2), it runs smoothly but setting it as 4 gives strange errors.

After updating to 1.9.0 I face a different error:

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:832, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
aditya-maltecommented, Mar 2, 2021

Thanks for the quick reply. Yeah, it’s strange that it works on 2 GPUs but not on 4. Will check again and let you know.

1reaction
aditya-maltecommented, Mar 4, 2021

Hi @sgugger, Good news, the issue seems to have been an environment issue. Thanks for the instant help

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to solve the famous `unhandled cuda error, NCCL ...
I had the right cuda installed meaning: python -c "import torch;print(torch.version.cuda)" #was equal to nvcc -V.
Read more >
Evaluate doesn't play nicely with Accelerate in multi-GPU ...
accelerate launch --num_processes 2 --num_machines 1 --multi_gpu repro.py. This crashes with error such as: RuntimeError: NCCL error in: .
Read more >
Massively Scale Your Deep Learning Training with NCCL 2.4
Using multiple GPUs to train neural networks has become quite common with all deep learning frameworks, providing optimized, multi-GPU, ...
Read more >
Distributed data parallel freezes without error message
Hello, I'm trying to use the distributed data parallel to train a resnet model on mulitple GPU on multiple nodes. The script is...
Read more >
DistributedDataParallel init hanging - fastai
I am trying to do single node multi-gpu (4 gpus) training with ... This error is not fastai related but there might be...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found