Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Facing NCCL error on Multi-GPU training(on single machine) using run_glue.py script

See original GitHub issue

Environment info

transformers version: 4.3.2
Platform: Linux-4.19.0-14-cloud-amd64-x86_64-with-debian-buster-sid
Python version: 3.7.9
PyTorch version (GPU?): 1.7.0 (True)
Tensorflow version (GPU?): 2.4.1 (True)
Using GPU in script?: 4xTesla T4 (GCP)
Using distributed or parallel set-up in script?: torch.distributed.launch

Who can help

Information

Model I am using (Bert, XLNet …): DistilRoberta

The problem arises when using:

[*] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[*] my own task or dataset: (give details below)

Regression task with a single output, using BertForSequenceClassification

To reproduce

Steps to reproduce the behavior:

1.python -m torch.distributed.launch --nproc_per_node 4 /home/run_glue.py --train_file /home/data/train.csv --validation_file /home/data/dev.csv --test_file /home/data/test.csv --model_name_or_path distilroberta-base --output_dir /home/model --num_train_epochs 5 --per_device_train_batch_size 1 --per_device_eval_batch_size 16 --do_train --do_eval --fp16 --gradient_accumulation_steps 2 --do_predict --logging_steps 100 --evaluation_strategy steps --save_steps 100 --overwrite_output_dir

File “/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 442, in init_process_group 732793de051f:1895:1925 [1] NCCL INFO transport/shm.cc:101 -> 2 732793de051f:1895:1925 [1] NCCL INFO transport.cc:30 -> 2 732793de051f:1895:1925 [1] NCCL INFO transport.cc:49 -> 2 732793de051f:1895:1925 [1] NCCL INFO init.cc:766 -> 2 732793de051f:1895:1925 [1] NCCL INFO init.cc:840 -> 2 732793de051f:1895:1925 [1] NCCL INFO group.cc:73 -> 2 [Async thread] barrier() File “/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 1947, in barrier Traceback (most recent call last): File “/home/run_text_classification.py”, line 480, in <module> work = _default_pg.barrier() RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729138878/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8 main() File “/home/run_text_classification.py”, line 163, in main model_args, data_args, training_args = parser.parse_args_into_dataclasses() File “/opt/conda/lib/python3.7/site-packages/transformers/hf_argparser.py”, line 180, in parse_args_into_dataclasses obj = dtype(**inputs) File “<string>”, line 60, in init File “/opt/conda/lib/python3.7/site-packages/transformers/training_args.py”, line 478, in post_init if is_torch_available() and self.device.type != “cuda” and self.fp16: File “/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py”, line 1346, in wrapper return func(*args, **kwargs) File “/opt/conda/lib/python3.7/site-packages/transformers/training_args.py”, line 583, in device return self._setup_devices

732793de051f:1897:1927 [3] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device File “/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py”, line 1336, in get 732793de051f:1897:1927 [3] NCCL INFO include/shm.h:41 -> 2

732793de051f:1897:1927 [3] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-b3d54cebe4167a34-0-2-3 (size 9637888)

Expected behavior

Expected model training to proceed smoothly using 4xGPU. When I run the said script with nproc_per_node=1(or even 2), it runs smoothly but setting it as 4 gives strange errors.

After updating to 1.9.0 I face a different error:

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:832, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Issue Analytics

State:
Created 3 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

2reactions

aditya-maltecommented, Mar 2, 2021

Thanks for the quick reply. Yeah, it’s strange that it works on 2 GPUs but not on 4. Will check again and let you know.

1reaction

aditya-maltecommented, Mar 4, 2021

Hi @sgugger, Good news, the issue seems to have been an environment issue. Thanks for the instant help

Top Results From Across the Web

How to solve the famous `unhandled cuda error, NCCL ...

I had the right cuda installed meaning: python -c "import torch;print(torch.version.cuda)" #was equal to nvcc -V.

Evaluate doesn't play nicely with Accelerate in multi-GPU ...

accelerate launch --num_processes 2 --num_machines 1 --multi_gpu repro.py. This crashes with error such as: RuntimeError: NCCL error in: .

Massively Scale Your Deep Learning Training with NCCL 2.4

Using multiple GPUs to train neural networks has become quite common with all deep learning frameworks, providing optimized, multi-GPU, ...

Distributed data parallel freezes without error message

Hello, I'm trying to use the distributed data parallel to train a resnet model on mulitple GPU on multiple nodes. The script is...

DistributedDataParallel init hanging - fastai

I am trying to do single node multi-gpu (4 gpus) training with ... This error is not fastai related but there might be...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Facing NCCL error on Multi-GPU training(on single machine) using run_glue.py script

Environment info

Who can help

Information

To reproduce

Expected behavior

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Question regarding training of BartForConditionalGeneration

[Wav2Vec2] Improve SpecAugment function by converting numpy based function to pytorch based function