CUDA initialization
See original GitHub issueSystem Info
Hello everybody. I keep encountering the same issue: I use '1.12.1+cu102'and FastAI '2.7.9'.
I need to use the multiple GPUs in our server to train deeper networks with more images.
___
accelerate env
Traceback (most recent call last):
File "/home/andrea/anaconda3/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/env.py", line 34, in env_command
accelerate_config = load_config_from_file(args.config_file).to_dict()
File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 63, in load_config_from_file
return config_class.from_yaml_file(yaml_file=config_file)
File "/home/andrea/anaconda3/lib/python3.9/site-packages/accelerate/commands/config/config_args.py", line 116, in from_yaml_file
return cls(**config_dict)
TypeError: __init__() got an unexpected keyword argument 'command_file'
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - My own task or dataset (give details below)
Reproduction
Here is the script that I am using:
from fastai.vision.all import * from fastai.distributed import * from fastai.vision.models.xresnet import *
from accelerate import Accelerator from accelerate.utils import set_seed from timm import create_model from accelerate import notebook_launcher
def get_msk(o): return path_Rflbl+fr’/RfM_{o.stem}{o.suffix.lower()}___fuse{o.suffix.lower()}’
numeral_codes=[i for i in range(0,16)] #as I am labeling 16 categories in the data
print(‘numeral codes ‘, numeral_codes)
file = open(path+’/codes.txt’, “w+”)
Saving the array in a text file
content = str(numeral_codes) file.write(content) file.close()
def train(): dls = SegmentationDataLoaders.from_label_func( path, bs=8, fnames = get_image_files(path+‘/Impng’), label_func = get_msk, codes = np.loadtxt(path+‘/codes.txt’, dtype=str) ) learn = unet_learner(dls, resnet34) with learn.distrib_ctx(in_notebook=True, sync_bn=False): learn.fit(10)
notebook_launcher(train, num_processes=4)
It all works until I use notebook launcher. then it comes up with:
ValueError Traceback (most recent call last) Input In [46], in <cell line: 24>() 19 with learn.distrib_ctx(in_notebook=True, sync_bn=False): 20 learn.fit(10) —> 24 notebook_launcher(train, num_processes=4)
File ~/anaconda3/lib/python3.9/site-packages/accelerate/launchers.py:102, in notebook_launcher(function, args, num_processes, use_fp16, mixed_precision, use_port)
95 raise ValueError(
96 "To launch a multi-GPU training from your notebook, the Accelerator
should only be initialized "
97 "inside your training function. Restart your notebook and make sure no cells initializes an "
98 “Accelerator
.”
99 )
101 if torch.cuda.is_initialized():
–> 102 raise ValueError(
103 "To launch a multi-GPU training from your notebook, you need to avoid running any instruction "
104 "using torch.cuda
in any cell. Restart your notebook and make sure no cells use any CUDA "
105 “function.”
106 )
108 try:
109 mixed_precision = PrecisionType(mixed_precision.lower())
ValueError: To launch a multi-GPU training from your notebook, you need to avoid running any instruction using torch.cuda
in any cell. Restart your notebook and make sure no cells use any CUDA function.
Yet, I have no CUDA instructions. And I need the notebook launcher in order to train on multiple GPUs (I would have 6).
Do you have any ideas? Do I need to update some version of something?
Expected behavior
if, instead of my data, I use
path = untar_data(URLs.CAMVID_TINY)
I can train up to 4 GPUs, independently and also using xresnet50. The processes seem to run on 4 independent GPUs, but I am not sure yet that each is a chunk of the total and it tries to execute the calculation in parallel as intended (by me). For instance I am not sure that the memory it uses for the whole calculation is the sum of the GPUs memory.
Anyhow, could you please help me in executing this calculation on multiple GPUs?
Issue Analytics
- State:
- Created 9 months ago
- Comments:20 (1 by maintainers)
@muellerzr But if I start the calculation on more than 2 GPUs, it crashes for out-of-memory errors:
Now, the reason why I want to use many GPUs is exactly for avoiding this sort of errors. Do you have any idea how could I manage the memory and/or ask accelerate to do it for us? we plan to have MANY images to train, and use at least Resnet50… while now I am confined to Resnet34.
Which is not bad but…
Thank you for your time!