question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA initialization

See original GitHub issue

System Info

Hello everybody. I keep encountering the same issue: I use '1.12.1+cu102'and FastAI '2.7.9'.
I need to use the multiple GPUs in our server to train deeper networks with more images. 
___
accelerate env

Traceback (most recent call last):
  File "/home/andrea/anaconda3/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/env.py", line 34, in env_command
    accelerate_config = load_config_from_file(args.config_file).to_dict()
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 63, in load_config_from_file
    return config_class.from_yaml_file(yaml_file=config_file)
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 116, in from_yaml_file
    return cls(**config_dict)
TypeError: __init__() got an unexpected keyword argument 'command_file'

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Here is the script that I am using:


from fastai.vision.all import * from fastai.distributed import * from fastai.vision.models.xresnet import *

from accelerate import Accelerator from accelerate.utils import set_seed from timm import create_model from accelerate import notebook_launcher

def get_msk(o): return path_Rflbl+fr’/RfM_{o.stem}{o.suffix.lower()}___fuse{o.suffix.lower()}’

numeral_codes=[i for i in range(0,16)] #as I am labeling 16 categories in the data print(‘numeral codes ‘, numeral_codes)
file = open(path+’/codes.txt’, “w+”)

Saving the array in a text file

content = str(numeral_codes) file.write(content) file.close()

def train(): dls = SegmentationDataLoaders.from_label_func( path, bs=8, fnames = get_image_files(path+‘/Impng’), label_func = get_msk, codes = np.loadtxt(path+‘/codes.txt’, dtype=str) ) learn = unet_learner(dls, resnet34) with learn.distrib_ctx(in_notebook=True, sync_bn=False): learn.fit(10)

notebook_launcher(train, num_processes=4)


It all works until I use notebook launcher. then it comes up with:

ValueError Traceback (most recent call last) Input In [46], in <cell line: 24>() 19 with learn.distrib_ctx(in_notebook=True, sync_bn=False): 20 learn.fit(10) —> 24 notebook_launcher(train, num_processes=4)

File ~/anaconda3/lib/python3.9/site-packages/accelerate/launchers.py:102, in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port) 95 raise ValueError( 96 "To launch a multi-GPU training from your notebook, the Accelerator should only be initialized " 97 "inside your training function. Restart your notebook and make sure no cells initializes an " 98 “Accelerator.” 99 ) 101 if torch.cuda.is_initialized(): –> 102 raise ValueError( 103 "To launch a multi-GPU training from your notebook, you need to avoid running any instruction " 104 "using torch.cuda in any cell. Restart your notebook and make sure no cells use any CUDA " 105 “function.” 106 ) 108 try: 109 mixed_precision = PrecisionType(mixed_precision.lower())

ValueError: To launch a multi-GPU training from your notebook, you need to avoid running any instruction using torch.cuda in any cell. Restart your notebook and make sure no cells use any CUDA function.


Yet, I have no CUDA instructions. And I need the notebook launcher in order to train on multiple GPUs (I would have 6).

Do you have any ideas? Do I need to update some version of something?

Expected behavior

if, instead of my data, I use
path = untar_data(URLs.CAMVID_TINY)

I can train up to 4 GPUs, independently and also using xresnet50. The processes seem to run on 4 independent GPUs, but I am not sure yet that each is a chunk of the total and it tries to execute the calculation in parallel as intended (by me). For instance I am not sure that the memory it uses for the whole calculation is the sum of the GPUs memory.

Anyhow, could you please help me in executing this calculation on multiple GPUs?

Issue Analytics

  • State:open
  • Created 9 months ago
  • Comments:20 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
Afera672commented, Dec 21, 2022
Screenshot 2022-12-21 at 14 24 09
0reactions
Afera672commented, Dec 21, 2022

@muellerzr But if I start the calculation on more than 2 GPUs, it crashes for out-of-memory errors: Screenshot 2022-12-21 at 14 34 20 Now, the reason why I want to use many GPUs is exactly for avoiding this sort of errors. Do you have any idea how could I manage the memory and/or ask accelerate to do it for us? we plan to have MANY images to train, and use at least Resnet50… while now I am confined to Resnet34. Which is not bad but… Thank you for your time!

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA C++ Programming Guide - NVIDIA Documentation Center
CUDA ®: A General-Purpose Parallel Computing Platform and Programming Model; 1.3. ... CUDA Runtime. 3.2.1. Initialization; 3.2.2. Device Memory
Read more >
How to initialize CUDA so I can make valid execution time ...
The only solution I am aware of for the second item is to run the CUDA kernel code you want to time once...
Read more >
torch.cuda.init — PyTorch 1.13 documentation
Initialize PyTorch's CUDA state. You may need to call this explicitly if you are interacting with PyTorch via its C API, as Python...
Read more >
[E] [TRT] CUDA initialization failure with error 35. #685 - GitHub
Please check your CUDA installation: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html I am tried ultralytics/yolov5:v5.0 ...
Read more >
NVIDIA GPU driver fails to initialize - IBM
CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 3 -> initialization error Result = FAIL ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found