Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CUDA initialization

See original GitHub issue

System Info

Hello everybody. I keep encountering the same issue: I use '1.12.1+cu102'and FastAI '2.7.9'.
I need to use the multiple GPUs in our server to train deeper networks with more images. 
___
accelerate env

Traceback (most recent call last):
  File "/home/andrea/anaconda3/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/env.py", line 34, in env_command
    accelerate_config = load_config_from_file(args.config_file).to_dict()
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 63, in load_config_from_file
    return config_class.from_yaml_file(yaml_file=config_file)
  File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 116, in from_yaml_file
    return cls(**config_dict)
TypeError: __init__() got an unexpected keyword argument 'command_file'

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Here is the script that I am using:

from fastai.vision.all import * from fastai.distributed import * from fastai.vision.models.xresnet import *

from accelerate import Accelerator from accelerate.utils import set_seed from timm import create_model from accelerate import notebook_launcher

def get_msk(o): return path_Rflbl+fr’/RfM_{o.stem}{o.suffix.lower()}___fuse{o.suffix.lower()}’

numeral_codes=[i for i in range(0,16)] #as I am labeling 16 categories in the data print(‘numeral codes ‘, numeral_codes)
file = open(path+’/codes.txt’, “w+”)

Saving the array in a text file

content = str(numeral_codes) file.write(content) file.close()

def train(): dls = SegmentationDataLoaders.from_label_func( path, bs=8, fnames = get_image_files(path+‘/Impng’), label_func = get_msk, codes = np.loadtxt(path+‘/codes.txt’, dtype=str) ) learn = unet_learner(dls, resnet34) with learn.distrib_ctx(in_notebook=True, sync_bn=False): learn.fit(10)

notebook_launcher(train, num_processes=4)

It all works until I use notebook launcher. then it comes up with:

ValueError Traceback (most recent call last) Input In [46], in <cell line: 24>() 19 with learn.distrib_ctx(in_notebook=True, sync_bn=False): 20 learn.fit(10) —> 24 notebook_launcher(train, num_processes=4)

File ~/anaconda3/lib/python3.9/site-packages/accelerate/launchers.py:102, in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port) 95 raise ValueError( 96 "To launch a multi-GPU training from your notebook, the Accelerator should only be initialized " 97 "inside your training function. Restart your notebook and make sure no cells initializes an " 98 “Accelerator.” 99 ) 101 if torch.cuda.is_initialized(): –> 102 raise ValueError( 103 "To launch a multi-GPU training from your notebook, you need to avoid running any instruction " 104 "using torch.cuda in any cell. Restart your notebook and make sure no cells use any CUDA " 105 “function.” 106 ) 108 try: 109 mixed_precision = PrecisionType(mixed_precision.lower())

ValueError: To launch a multi-GPU training from your notebook, you need to avoid running any instruction using torch.cuda in any cell. Restart your notebook and make sure no cells use any CUDA function.

Yet, I have no CUDA instructions. And I need the notebook launcher in order to train on multiple GPUs (I would have 6).

Do you have any ideas? Do I need to update some version of something?

Expected behavior

if, instead of my data, I use
path = untar_data(URLs.CAMVID_TINY)

I can train up to 4 GPUs, independently and also using xresnet50. The processes seem to run on 4 independent GPUs, but I am not sure yet that each is a chunk of the total and it tries to execute the calculation in parallel as intended (by me). For instance I am not sure that the memory it uses for the whole calculation is the sum of the GPUs memory.

Anyhow, could you please help me in executing this calculation on multiple GPUs?

Issue Analytics

State:
Created 9 months ago
Comments:20 (1 by maintainers)

Top GitHub Comments

3reactions

Afera672commented, Dec 21, 2022

0reactions

Afera672commented, Dec 21, 2022

@muellerzr But if I start the calculation on more than 2 GPUs, it crashes for out-of-memory errors: Screenshot 2022-12-21 at 14 34 20 Now, the reason why I want to use many GPUs is exactly for avoiding this sort of errors. Do you have any idea how could I manage the memory and/or ask accelerate to do it for us? we plan to have MANY images to train, and use at least Resnet50… while now I am confined to Resnet34. Which is not bad but… Thank you for your time!