question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hi

i have installed environment in the yaml file and installed torch 1.8 follow the setting in readme

my cuda version is 11.4, it seems that it is a version conflict of NCCL, pytorch and cuda

Is my cuda version to high?

ssh://xh@210.28.134.34:22/home2/xh/.conda/envs/skg/bin/python -u -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_prefix_compwebq.cfg --run_name T5_base_prefix_compwebq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 2 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_prefix_compwebq --overwrite_output_dir --per_device_train_batch_size 2 --per_device_eval_batch_size 4 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
INFO:filelock:Lock 140211123887800 acquired on .lock
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
INFO:filelock:Lock 140211123887800 released on .lock
INFO:filelock:Lock 140144150953768 acquired on .lock
INFO:filelock:Lock 140144150953768 released on .lock
INFO:filelock:Lock 139898741587640 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 139898741587640 released on .lock
INFO:filelock:Lock 139711655354096 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 139711655354096 released on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Killing subprocess 30316
Killing subprocess 30320
Killing subprocess 30321
Killing subprocess 30322
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home2/xh/.conda/envs/skg/bin/python', '-u', 'train.py', '--local_rank=3', '--seed', '2', '--cfg', 'Salesforce/T5_base_prefix_compwebq.cfg', '--run_name', 'T5_base_prefix_compwebq', '--logging_strategy', 'steps', '--logging_first_step', 'true', '--logging_steps', '4', '--evaluation_strategy', 'steps', '--eval_steps', '500', '--metric_for_best_model', 'avr', '--greater_is_better', 'true', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '1', '--load_best_model_at_end', '--gradient_accumulation_steps', '2', '--num_train_epochs', '400', '--adafactor', 'true', '--learning_rate', '5e-5', '--do_train', '--do_eval', '--do_predict', '--predict_with_generate', '--output_dir', 'output/T5_base_prefix_compwebq', '--overwrite_output_dir', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '4', '--generation_num_beams', '4', '--generation_max_length', '128', '--input_max_length', '1024', '--ddp_find_unused_parameters', 'true']' returned non-zero exit status 1.

Process finished with exit code 1

I also tried torch 1.11+ cu113,got another error


(skg) xh@4210GPU:~/PycharmProject/UnifiedSKG$ python -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_finetune_compwebq.cfg --run_name T5_base_finetune_compwebq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 2 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_compwebq --overwrite_output_dir --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17913) of binary: /home2/xh/.conda/envs/skg/bin/python
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 17914)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 17915)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 17916)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 17913)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Looking forward to your reply. Thank you.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
ChenWu98commented, Mar 16, 2022

I think it is a PyTorch version issue. My personal experience is that removing the following lines works for other PyTorch versions. https://github.com/HKUNLP/UnifiedSKG/blob/fab45fea3a349c9dbda4ed34482df227920272db/train.py#L27-L29

1reaction
Timothyxxxcommented, Mar 14, 2022

Indeed, we met the similar situation during our experiments on some machines(we used a lot of GPUs on different kinds of HPC). I remember we fixed that issue by installing proper PyTorch for your CUDA version.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Release Notes :: NVIDIA Deep Learning NCCL Documentation
NCCL Release Notes. This document describes the key features, software enhancements and improvements, and known issues for NCCL 2.16.2.
Read more >
How to check the version of NCCL - Stack Overflow
you can do python -c "import torch;print(torch.cuda.nccl.version())" with pytorch. I wish I new the terminal command without pytorch.
Read more >
NVIDIA/nccl: Optimized primitives for collective multi ... - GitHub
NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce- ...
Read more >
Nccl - :: Anaconda.org
Description. The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are ...
Read more >
Horovod on GPU
In most situations, using NCCL 2 will significantly improve performance over the CPU version. NCCL 2 provides the allreduce operation optimized for NVIDIA ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found