š¤[question] Multi-GPU Error for Custom Optimizer
See original GitHub issueDescribe your question
I am porting a PyTorch code that uses a fastai-based optimizer (OptimWrapper over Adam). I notice this error on moving from single-GPU to multi-GPU setting. A single-GPU works fine since horovodās DistributedOptimizer isnāt utilized. It seems that on hvd.DistributedOptimizer call; the optimizer is reinitialized. The optimizer expects an additional parameter āwdā that isnāt passed. Is there a simple fix to enable passing of extra arguments?
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 131, in <module>
sys.exit(main(args.train_entrypoint))
File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 112, in main
trial_inst = trial_class(trial_context)
File "/run/determined/workdir/tools/train_dai_optimizer.py", line 132, in __init__
self.optimizer = self.context.wrap_optimizer(self.optimizer)
File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_context.py", line 230, in wrap_optimizer
optimizer = hvd.DistributedOptimizer(
File "/opt/conda/lib/python3.8/site-packages/horovod/torch/optimizer.py", line 598, in DistributedOptimizer
return cls(optimizer.param_groups, named_parameters, compression, backward_passes_per_step, op,
File "/opt/conda/lib/python3.8/site-packages/horovod/torch/optimizer.py", line 43, in __init__
super(self.__class__, self).__init__(params)
TypeError: __init__() missing 1 required positional argument: 'wd'
Issue Analytics
- State:
- Created a year ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
Keras/Tensorflow multi GPU InvalidArgumentError in optimizer
Your issue seems to be similar to the one reported here. It appears that the input data size must be a multiple of...
Read more >How to use multi-GPU Ā· Issue #4591 Ā· facebookresearch/fairseq
RuntimeError: CUDA out of memory (OOM) happens in one gpu. So it is not a multi-gpu problem. Allocating memory is necessary because you...
Read more >Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >MPSGraph adamUpdateWithLearningRaā¦ - Apple Developer
So, I have this same problem. It is the adam optimizer. SGD runs fine using the mnist example code posted here. I first...
Read more >tf.keras.optimizers.Optimizer | TensorFlow v2.11.0
Function to update variable value based on given gradients. This method must be implemented in customized optimizers. Args. gradient, backpropagated gradientĀ ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I am trying to port a
OptimWrapper
based optimizer to DAI. I observed that the performance drops on switching from a Fastai optimizer to a Pytorch one. I do agree that horovod and fastai donāt seem to be compatible. More specifically, horovod doesnāt support custom optimizers yet.I canāt really use torch distributed launch unless the determined version is upgraded on the server. The Core API looks promising. I will give it a try.
Thank you for looking into the issue.
The error might be different but the root cause is the same. FastAi Optimizer class expects a 2nd argument
cbs
here. I am using the OptimWrapper class provided by fastai which expects an argumentwd
. My understanding is thatwrap_optimizer
reinitializes the optimizer, without any functionality to the pass these arguments. It seems to be a Horovod restriction. Let me share the coder snippets and version information with you by tomorrow.