question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

šŸ¤”[question] Multi-GPU Error for Custom Optimizer

See original GitHub issue

Describe your question

I am porting a PyTorch code that uses a fastai-based optimizer (OptimWrapper over Adam). I notice this error on moving from single-GPU to multi-GPU setting. A single-GPU works fine since horovodā€™s DistributedOptimizer isnā€™t utilized. It seems that on hvd.DistributedOptimizer call; the optimizer is reinitialized. The optimizer expects an additional parameter ā€˜wdā€™ that isnā€™t passed. Is there a simple fix to enable passing of extra arguments?

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 131, in <module>
    sys.exit(main(args.train_entrypoint))
  File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 112, in main
    trial_inst = trial_class(trial_context)
  File "/run/determined/workdir/tools/train_dai_optimizer.py", line 132, in __init__
    self.optimizer = self.context.wrap_optimizer(self.optimizer)
  File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_context.py", line 230, in wrap_optimizer
    optimizer = hvd.DistributedOptimizer(
  File "/opt/conda/lib/python3.8/site-packages/horovod/torch/optimizer.py", line 598, in DistributedOptimizer
    return cls(optimizer.param_groups, named_parameters, compression, backward_passes_per_step, op,
  File "/opt/conda/lib/python3.8/site-packages/horovod/torch/optimizer.py", line 43, in __init__
    super(self.__class__, self).__init__(params)
TypeError: __init__() missing 1 required positional argument: 'wd'

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
abhinavagarwallacommented, Oct 13, 2022

I am trying to port a OptimWrapper based optimizer to DAI. I observed that the performance drops on switching from a Fastai optimizer to a Pytorch one. I do agree that horovod and fastai donā€™t seem to be compatible. More specifically, horovod doesnā€™t support custom optimizers yet.

I canā€™t really use torch distributed launch unless the determined version is upgraded on the server. The Core API looks promising. I will give it a try.

Thank you for looking into the issue.

1reaction
abhinavagarwallacommented, Oct 4, 2022

The error might be different but the root cause is the same. FastAi Optimizer class expects a 2nd argument cbs here. I am using the OptimWrapper class provided by fastai which expects an argument wd. My understanding is that wrap_optimizer reinitializes the optimizer, without any functionality to the pass these arguments. It seems to be a Horovod restriction. Let me share the coder snippets and version information with you by tomorrow.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Keras/Tensorflow multi GPU InvalidArgumentError in optimizer
Your issue seems to be similar to the one reported here. It appears that the input data size must be a multiple of...
Read more >
How to use multi-GPU Ā· Issue #4591 Ā· facebookresearch/fairseq
RuntimeError: CUDA out of memory (OOM) happens in one gpu. So it is not a multi-gpu problem. Allocating memory is necessary because you...
Read more >
Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >
MPSGraph adamUpdateWithLearningRaā€¦ - Apple Developer
So, I have this same problem. It is the adam optimizer. SGD runs fine using the mnist example code posted here. I first...
Read more >
tf.keras.optimizers.Optimizer | TensorFlow v2.11.0
Function to update variable value based on given gradients. This method must be implemented in customized optimizers. Args. gradient, backpropagated gradientĀ ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found