Reintroduce AzureMLCluster
See original GitHub issueFeature request: AzureMLCluster
Background: long ago, AzureMLCluster
was added as the second “cloud provider” in this repository. There were some problems described in #206 and issues. The implementation was deprecated after the introduction of AzureVMCluster
.
Since, azureml-core
has stabilized and now contains everything needed to implement AzureMLCluster
.
It is trivial to use dask-mpi
to startup the cluster on Azure ML, which can be seen XGBoost tutorial: https://github.com/Azure/azureml-examples/blob/main/tutorials/using-xgboost/src/run.py
Implementation
Some open questions:
- implementation:
dask-mpi
?- do we need to specify # of CPUs, etc.?
- does this work with GPUs?
- can this scale up/down?
- implementation: custom start
dask.Scheduler
anddask.Worker
s based on IP- prototype here: https://github.com/Azure/azureml-v2-preview/blob/main/examples/dask/src/startDask.py
- uses PyTorch backend to assign/get IPs then startup processes
Outline of work needed
This part is less clear to me, now that I’m looking at it - I don’t want to further delay opening this issue. But some things we should ensure are:
- only depends on
azureml-core
- this is properly scoped and in line with other cloud providers, i.e. not starting up Jupyter
- scaling up/down works (old implementation had issues; concerns with dask-mpi proposal)
- this is tested
- tests in this repo
- we can also test large workloads/GPUs in https://github.com/Azure/azureml-examples
- works with
azureml.core.Environment
- need existing aml compute target, or create on behalf of user?
Additional details
I cannot personally contribute or maintain this work. A few people at Microsoft have indicated a willingness to do so - I am opening this issue publicly, after which I will start the discussion internally.
This would not be an “officially supported” part of the Azure Machine Learning service - it would be an open source contribution, provided “as-is” and without official SLAs or support, to the Dask community.
- Cody
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:14 (10 by maintainers)
Top GitHub Comments
As I remember it, I was one of a few proponent voices for the original
AzureMLCluster
, so I’d very much be in support of a reimplementation of the functionality! Particularly if the new implementation is able to be more independent of the wider stack used to implement AzureML.What I particularly liked about
AzureMLCluster
was that it was all self-contained within AzureML. This seemed to be a neat extension of the concept behind AzureML, in that researchers could be given access to AzureML and no other part of the Azure estate, and still have everything they need for the research they’re engaged in. As @jacobtomlinson says, theAzureVMCluster
breaks this paradigm somewhat.I reckon I can justify some time contributing to developing a new version of
AzureMLCluster
- I think there will be value to this cluster manager to at least the Informatics Lab. Of course, ongoing maintenance may be a bit harder to fit in 😑 @jacobtomlinson / @lostmygithubaccount happy to discuss further!@jacobtomlinson So sorry, just getting to this now. You can use MPI in AzureML using a commandJob and specifying the
distribution.type
as MPI (docs here).You need the
az
cli and theaz ml
extension installed (instructions here). The first thing you have to do is create an environment withdask[distributed]
,dask_mpi
, andmpi4py
. You can use a conda file to create an AML environment. MWE:Then register the environment:
Then you can use it in a commandJob:
I then have this snippet:
I just add an
--mpi
flag to my argparser that I include in the job yaml for AzureML as shown above and omit when running locally and then initialize the client using that snippet. It’s really nice because my code stays the exact same running locally and remotely.Hope this is helpful, let me know if anything else is needed.