DDLRUN + DeepSpeed on SUMMIT
See original GitHub issueHi,
I am trying to use deepspeed on SUMMIT using ddlrun, but it doesn’t work properly.
I am testing it with cifar like:
ddlrun deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
Could you please give us an example for using deepspeed with horovod , mpi and ddlrun ?
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
Distributed Deep Learning on Summit | OLCF
DDL provides a utility called DDLRUN which is used to launch the learning job on any number of nodes/gpus. Data Parallel Distributed Deep ......
Read more >Enabling Efficient Inference of Transformer Models at ... - arXiv
DeepSpeed Inference reduces latency by up to 7.3× over ... In addition, for large models, even the peak memory band-.
Read more >Fit More and Train Faster With ZeRO via DeepSpeed and ...
The new --sharded_ddp and --deepspeed command line Trainer arguments provide FairScale and DeepSpeed integration respectively. Here is the full ...
Read more >Using ddlrun tool - IBM
This tool performs the following tasks automatically: Determines the necessary arguments to pass to MPI based on the current environment and version of...
Read more >Top 6 alternatives to Microsoft's DeepSpeed
Microsoft's DeepSpeed was introduced in 2020 and is one of the most popular deep learning ... Machine Learning Developers Summit (MLDS) 2023
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks @jeffra for the update. I will test it and I will give you my feedback.
Oh, that was actually for using Megatron-LM code, which doesn’t use DeepSpeed distributed code.
I will test it again with the cifar test.