Question: OneBitAdam on Eth/TCP
See original GitHub issueI’m interested in training BERT using multiple nodes with multiple GPUs (Titan-V). Our cluster is Kubernetes-based and we dont have Infiniband interconnects but rather 10Gb eth.
Using the provided Dockerfile (with up-to-date Deepspeed code) we’re unable to run it. It is missing mpiname
, other functionalities from openmpi and mvapich
launcher is missing.
Does Onebitadam support such setup? if so could you please provide details on how to enable OneBitAdam on TCP/eth based networking?
Thanks
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (6 by maintainers)
Top Results From Across the Web
1-bit Adam: Up to 5x less communication volume and up to 3.4 ...
1-bit Adam: Up to 5x less communication volume and up to 3.4x faster training · On 03/07/2022 we released 0/1 Adam, which is...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi, I’m sorry but I didn’t have a chance. Its definitely on my todo list and I’ll update once I get to run such setup.
Hey @peteriz, were you able to try this out? I am setting up an academic lab and not sure if 10G ETH interconnects are sufficient.