question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug][HPO] CUDA does not support fork so you cannot AutoGluon HPO if you've initialized CUDA

See original GitHub issue

CUDA does not support fork: See PyTorch documentation (https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing). Thus, if we initialize the cuda before calling the autogluon scheduler, we will see an error: CUDA: initialization error

  • Example 1 (runnable)
pip install -U --pre "mxnet_cu100>=1.7.0b20200713, <2.0.0" -f https://sxjscience.github.io/KDD2020/
import numpy as np
import autogluon as ag
import mxnet as mx
from mxnet.gluon import nn, Trainer
from mxnet.util import use_np


@use_np
class Net:
    def train_fn(self, args, reporter):
        gpu_ctx_l = [mx.gpu(i) for i in range(mx.context.num_gpus())]
        print('num_gpus:', len(gpu_ctx_l))
        np.random.seed(123)
        mx.random.seed(123)
        net = nn.HybridSequential()
        net.add(nn.Dense(16))
        net.add(nn.Activation('relu'))
        net.add(nn.Dense(4))
        net.hybridize()
        net.initialize(ctx=gpu_ctx_l)
        trainer = Trainer(net.collect_params(), 'adam')
        for i in range(100):
            with mx.autograd.record():
                data = mx.np.random.normal(0, 1, (8, 4), ctx=gpu_ctx_l[0])
                out = net(data)
                loss = mx.np.square(out - data).sum()
                loss.backward()
                reporter(loss=loss.asnumpy().item(), iteration=i)
            trainer.step(1.0)



def run_tuning_jobs(fn, search_space):
    args_decorator = ag.args(**search_space)
    scheduler = ag.scheduler.FIFOScheduler(args_decorator(fn),
                                       resource={'num_cpus': 4, 'num_gpus': 1},
                                       num_trials=20,
                                       reward_attr='loss',
                                       time_attr='iteration')
    scheduler.run()
    scheduler.join_jobs()
    return scheduler


search_space = {
    'num_hidden': ag.space.Int(16, 32),
    'lr': ag.space.Real(1e-3, 1e-2)
}

net = Net()

scheduler = run_tuning_jobs(net.train_fn, search_space)
  • Example 2 (Raise error)
import numpy as np
import autogluon as ag
import mxnet as mx
from mxnet.gluon import nn, Trainer
from mxnet.util import use_np


@use_np
class Net:
    def train_fn(self, args, reporter):
        gpu_ctx_l = [mx.gpu(i) for i in range(mx.context.num_gpus())]
        print('num_gpus:', len(gpu_ctx_l))
        np.random.seed(123)
        mx.random.seed(123)
        net = nn.HybridSequential()
        net.add(nn.Dense(16))
        net.add(nn.Activation('relu'))
        net.add(nn.Dense(4))
        net.hybridize()
        net.initialize(ctx=gpu_ctx_l)
        trainer = Trainer(net.collect_params(), 'adam')
        for i in range(100):
            with mx.autograd.record():
                data = mx.np.random.normal(0, 1, (8, 4), ctx=gpu_ctx_l[0])
                out = net(data)
                loss = mx.np.square(out - data).sum()
                loss.backward()
                reporter(loss=loss.asnumpy().item(), iteration=i)
            trainer.step(1.0)



def run_tuning_jobs(fn, search_space):
    args_decorator = ag.args(**search_space)
    scheduler = ag.scheduler.FIFOScheduler(args_decorator(fn),
                                       resource={'num_cpus': 4, 'num_gpus': 1},
                                       num_trials=20,
                                       reward_attr='loss',
                                       time_attr='iteration')
    scheduler.run()
    scheduler.join_jobs()
    return scheduler


search_space = {
    'num_hidden': ag.space.Int(16, 32),
    'lr': ag.space.Real(1e-3, 1e-2)
}

net = Net()

# Add one line
a = mx.np.ones((10,), ctx=mx.gpu())
scheduler = run_tuning_jobs(net.train_fn, search_space)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
jwmuellercommented, Aug 1, 2020

Also @sxjscience pointed out the reason tasks like ImageClassification get around this issue is their task.fit() returns models to the CPU instead of GPU, which seems undesirable…

https://github.com/awslabs/autogluon/blob/5e1acab422289921ae9f7112e71855c2ea89e3b1/autogluon/task/image_classification/image_classification.py#L316

1reaction
sxjsciencecommented, Aug 1, 2020

For Ray/Tune, we are able to do this: @zhreshold @szha @jwmueller @Innixma @Jerryzcn

import numpy as np
import autogluon as ag
import ray
from ray import tune
import mxnet as mx
from mxnet.gluon import nn, Trainer
from mxnet.util import use_np

def get_mxnet_visible_gpus():
    """Get the number of GPUs that are visible to MXNet.

    Returns
    -------
    ctx_l
        The ctx list
    """
    import mxnet as mx
    gpu_count = 0
    while True:
        try:
            arr = mx.np.array(1.0, ctx=mx.gpu(gpu_count))
            arr.asnumpy()
            gpu_count += 1
        except Exception:
            break
    return [mx.gpu(i) for i in range(gpu_count)]


@use_np
class Net:
    def train_fn(self, args, reporter):
        np.random.seed(123)
        mx.random.seed(123)
        gpu_ctx_l = get_mxnet_visible_gpus()
        print(gpu_ctx_l)
        net = nn.HybridSequential()
        net.add(nn.Dense(args['num_hidden']))
        net.add(nn.Activation('relu'))
        net.add(nn.Dense(4))
        net.hybridize()
        net.initialize(ctx=gpu_ctx_l)
        trainer = Trainer(net.collect_params(), 'adam', {'learning_rate': args['lr']})
        for i in range(10):
            with mx.autograd.record():
                loss_l = []
                for ctx in gpu_ctx_l:
                    data = mx.np.random.normal(0, 1, (8, 4), ctx=ctx)
                    out = net(data)
                    loss = mx.np.square(out - data).sum()
                    loss_l.append(loss)
                for loss in loss_l:
                    loss.backward()
            sum_loss = sum([loss.asnumpy() for loss in loss_l])
            reporter(loss=-sum_loss, iteration=i)
            trainer.step(1.0)
        return net


search_space = {
    'num_hidden': tune.sample_from(lambda _: np.random.randint(16, 32)),
    'lr': tune.sample_from(lambda _: np.random.uniform(1e-3, 1e-2))
}

a = mx.np.ones((10,), ctx=mx.gpu())
net = Net()
analysis = tune.run(net.train_fn, config=search_space, num_samples=16, resources_per_trial={'gpu': 2})
print(analysis.dataframe())
Read more comments on GitHub >

github_iconTop Results From Across the Web

[Bug][HPO] CUDA does not support fork so you cannot ...
@zhreshold It seems that ray tune does not have the issue even if we are reusing the global variable. The following runs well...
Read more >
Training models with GPU support - AutoGluon
Training with GPU can significantly speed up base algorithms, and is a necessity for text and vision models where training without GPU is...
Read more >
FAQ — AutoGluon Documentation 0.6.2 documentation
Can I use GPUs for model training?¶ ... Yes! Most of the models used by AutoGluon support GPU training, including LightGBM, CatBoost, XGBoost,...
Read more >
Training models with GPU support - AutoGluon
Training with GPU can significantly speed up base algorithms, and is a necessity for text and vision models ... CUDA toolkit is required...
Read more >
Installation — AutoGluon Documentation 0.0.15 documentation
While running AutoGluon, I get error message “Check failed: e == cudaSuccess: CUDA: initialization error”. You may have the wrong version of MXNet...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found