[Bug][HPO] CUDA does not support fork so you cannot AutoGluon HPO if you've initialized CUDA
See original GitHub issueCUDA does not support fork: See PyTorch documentation (https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing). Thus, if we initialize the cuda before calling the autogluon scheduler, we will see an error: CUDA: initialization error
- Example 1 (runnable)
pip install -U --pre "mxnet_cu100>=1.7.0b20200713, <2.0.0" -f https://sxjscience.github.io/KDD2020/
import numpy as np
import autogluon as ag
import mxnet as mx
from mxnet.gluon import nn, Trainer
from mxnet.util import use_np
@use_np
class Net:
def train_fn(self, args, reporter):
gpu_ctx_l = [mx.gpu(i) for i in range(mx.context.num_gpus())]
print('num_gpus:', len(gpu_ctx_l))
np.random.seed(123)
mx.random.seed(123)
net = nn.HybridSequential()
net.add(nn.Dense(16))
net.add(nn.Activation('relu'))
net.add(nn.Dense(4))
net.hybridize()
net.initialize(ctx=gpu_ctx_l)
trainer = Trainer(net.collect_params(), 'adam')
for i in range(100):
with mx.autograd.record():
data = mx.np.random.normal(0, 1, (8, 4), ctx=gpu_ctx_l[0])
out = net(data)
loss = mx.np.square(out - data).sum()
loss.backward()
reporter(loss=loss.asnumpy().item(), iteration=i)
trainer.step(1.0)
def run_tuning_jobs(fn, search_space):
args_decorator = ag.args(**search_space)
scheduler = ag.scheduler.FIFOScheduler(args_decorator(fn),
resource={'num_cpus': 4, 'num_gpus': 1},
num_trials=20,
reward_attr='loss',
time_attr='iteration')
scheduler.run()
scheduler.join_jobs()
return scheduler
search_space = {
'num_hidden': ag.space.Int(16, 32),
'lr': ag.space.Real(1e-3, 1e-2)
}
net = Net()
scheduler = run_tuning_jobs(net.train_fn, search_space)
- Example 2 (Raise error)
import numpy as np
import autogluon as ag
import mxnet as mx
from mxnet.gluon import nn, Trainer
from mxnet.util import use_np
@use_np
class Net:
def train_fn(self, args, reporter):
gpu_ctx_l = [mx.gpu(i) for i in range(mx.context.num_gpus())]
print('num_gpus:', len(gpu_ctx_l))
np.random.seed(123)
mx.random.seed(123)
net = nn.HybridSequential()
net.add(nn.Dense(16))
net.add(nn.Activation('relu'))
net.add(nn.Dense(4))
net.hybridize()
net.initialize(ctx=gpu_ctx_l)
trainer = Trainer(net.collect_params(), 'adam')
for i in range(100):
with mx.autograd.record():
data = mx.np.random.normal(0, 1, (8, 4), ctx=gpu_ctx_l[0])
out = net(data)
loss = mx.np.square(out - data).sum()
loss.backward()
reporter(loss=loss.asnumpy().item(), iteration=i)
trainer.step(1.0)
def run_tuning_jobs(fn, search_space):
args_decorator = ag.args(**search_space)
scheduler = ag.scheduler.FIFOScheduler(args_decorator(fn),
resource={'num_cpus': 4, 'num_gpus': 1},
num_trials=20,
reward_attr='loss',
time_attr='iteration')
scheduler.run()
scheduler.join_jobs()
return scheduler
search_space = {
'num_hidden': ag.space.Int(16, 32),
'lr': ag.space.Real(1e-3, 1e-2)
}
net = Net()
# Add one line
a = mx.np.ones((10,), ctx=mx.gpu())
scheduler = run_tuning_jobs(net.train_fn, search_space)
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (2 by maintainers)
Top Results From Across the Web
[Bug][HPO] CUDA does not support fork so you cannot ...
@zhreshold It seems that ray tune does not have the issue even if we are reusing the global variable. The following runs well...
Read more >Training models with GPU support - AutoGluon
Training with GPU can significantly speed up base algorithms, and is a necessity for text and vision models where training without GPU is...
Read more >FAQ — AutoGluon Documentation 0.6.2 documentation
Can I use GPUs for model training?¶ ... Yes! Most of the models used by AutoGluon support GPU training, including LightGBM, CatBoost, XGBoost,...
Read more >Training models with GPU support - AutoGluon
Training with GPU can significantly speed up base algorithms, and is a necessity for text and vision models ... CUDA toolkit is required...
Read more >Installation — AutoGluon Documentation 0.0.15 documentation
While running AutoGluon, I get error message “Check failed: e == cudaSuccess: CUDA: initialization error”. You may have the wrong version of MXNet...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Also @sxjscience pointed out the reason tasks like ImageClassification get around this issue is their
task.fit()
returns models to the CPU instead of GPU, which seems undesirable…https://github.com/awslabs/autogluon/blob/5e1acab422289921ae9f7112e71855c2ea89e3b1/autogluon/task/image_classification/image_classification.py#L316
For Ray/Tune, we are able to do this: @zhreshold @szha @jwmueller @Innixma @Jerryzcn