Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

🤔[question] Multi-GPU Error for Custom Optimizer

See original GitHub issue

Describe your question

I am porting a PyTorch code that uses a fastai-based optimizer (OptimWrapper over Adam). I notice this error on moving from single-GPU to multi-GPU setting. A single-GPU works fine since horovod’s DistributedOptimizer isn’t utilized. It seems that on hvd.DistributedOptimizer call; the optimizer is reinitialized. The optimizer expects an additional parameter ‘wd’ that isn’t passed. Is there a simple fix to enable passing of extra arguments?

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 131, in <module>
    sys.exit(main(args.train_entrypoint))
  File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 112, in main
    trial_inst = trial_class(trial_context)
  File "/run/determined/workdir/tools/train_dai_optimizer.py", line 132, in __init__
    self.optimizer = self.context.wrap_optimizer(self.optimizer)
  File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_context.py", line 230, in wrap_optimizer
    optimizer = hvd.DistributedOptimizer(
  File "/opt/conda/lib/python3.8/site-packages/horovod/torch/optimizer.py", line 598, in DistributedOptimizer
    return cls(optimizer.param_groups, named_parameters, compression, backward_passes_per_step, op,
  File "/opt/conda/lib/python3.8/site-packages/horovod/torch/optimizer.py", line 43, in __init__
    super(self.__class__, self).__init__(params)
TypeError: __init__() missing 1 required positional argument: 'wd'

Issue Analytics

State:
Created a year ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

abhinavagarwallacommented, Oct 13, 2022

I am trying to port a OptimWrapper based optimizer to DAI. I observed that the performance drops on switching from a Fastai optimizer to a Pytorch one. I do agree that horovod and fastai don’t seem to be compatible. More specifically, horovod doesn’t support custom optimizers yet.

I can’t really use torch distributed launch unless the determined version is upgraded on the server. The Core API looks promising. I will give it a try.

Thank you for looking into the issue.

1reaction

abhinavagarwallacommented, Oct 4, 2022

The error might be different but the root cause is the same. FastAi Optimizer class expects a 2nd argument cbs here. I am using the OptimWrapper class provided by fastai which expects an argument wd. My understanding is that wrap_optimizer reinitializes the optimizer, without any functionality to the pass these arguments. It seems to be a Horovod restriction. Let me share the coder snippets and version information with you by tomorrow.

Top Results From Across the Web

Keras/Tensorflow multi GPU InvalidArgumentError in optimizer

Your issue seems to be similar to the one reported here. It appears that the input data size must be a multiple of...

How to use multi-GPU · Issue #4591 · facebookresearch/fairseq

RuntimeError: CUDA out of memory (OOM) happens in one gpu. So it is not a multi-gpu problem. Allocating memory is necessary because you...

Efficient Training on Multiple GPUs - Hugging Face

When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...

MPSGraph adamUpdateWithLearningRa… - Apple Developer

So, I have this same problem. It is the adam optimizer. SGD runs fine using the mnist example code posted here. I first...

tf.keras.optimizers.Optimizer | TensorFlow v2.11.0

Function to update variable value based on given gradients. This method must be implemented in customized optimizers. Args. gradient, backpropagated gradient ...