Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reintroduce AzureMLCluster

See original GitHub issue

Feature request: AzureMLCluster

Background: long ago, AzureMLCluster was added as the second “cloud provider” in this repository. There were some problems described in #206 and issues. The implementation was deprecated after the introduction of AzureVMCluster.

Since, azureml-core has stabilized and now contains everything needed to implement AzureMLCluster.

It is trivial to use dask-mpi to startup the cluster on Azure ML, which can be seen XGBoost tutorial: https://github.com/Azure/azureml-examples/blob/main/tutorials/using-xgboost/src/run.py

Implementation

Some open questions:

implementation: dask-mpi?
- do we need to specify # of CPUs, etc.?
- does this work with GPUs?
- can this scale up/down?
implementation: custom start dask.Scheduler and dask.Workers based on IP
- prototype here: https://github.com/Azure/azureml-v2-preview/blob/main/examples/dask/src/startDask.py
- uses PyTorch backend to assign/get IPs then startup processes

Outline of work needed

This part is less clear to me, now that I’m looking at it - I don’t want to further delay opening this issue. But some things we should ensure are:

only depends on azureml-core
this is properly scoped and in line with other cloud providers, i.e. not starting up Jupyter
scaling up/down works (old implementation had issues; concerns with dask-mpi proposal)
this is tested
- tests in this repo
- we can also test large workloads/GPUs in https://github.com/Azure/azureml-examples
works with azureml.core.Environment
need existing aml compute target, or create on behalf of user?

Additional details

I cannot personally contribute or maintain this work. A few people at Microsoft have indicated a willingness to do so - I am opening this issue publicly, after which I will start the discussion internally.

This would not be an “officially supported” part of the Azure Machine Learning service - it would be an open source contribution, provided “as-is” and without official SLAs or support, to the Dask community.

Cody

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:14 (10 by maintainers)

Top GitHub Comments

3reactions

DPeterKcommented, Mar 18, 2021

As I remember it, I was one of a few proponent voices for the original AzureMLCluster, so I’d very much be in support of a reimplementation of the functionality! Particularly if the new implementation is able to be more independent of the wider stack used to implement AzureML.

What I particularly liked about AzureMLCluster was that it was all self-contained within AzureML. This seemed to be a neat extension of the concept behind AzureML, in that researchers could be given access to AzureML and no other part of the Azure estate, and still have everything they need for the research they’re engaged in. As @jacobtomlinson says, the AzureVMCluster breaks this paradigm somewhat.

I reckon I can justify some time contributing to developing a new version of AzureMLCluster - I think there will be value to this cluster manager to at least the Informatics Lab. Of course, ongoing maintenance may be a bit harder to fit in 😑 @jacobtomlinson / @lostmygithubaccount happy to discuss further!

2reactions

jacobdanovitchcommented, Jan 10, 2022

@jacobtomlinson So sorry, just getting to this now. You can use MPI in AzureML using a commandJob and specifying the distribution.type as MPI (docs here).

You need the az cli and the az ml extension installed (instructions here). The first thing you have to do is create an environment with dask[distributed], dask_mpi, and mpi4py. You can use a conda file to create an AML environment. MWE:

# envs/dask.yaml
$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: dask-mpi
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
conda_file: envs/conda/dask.yaml
description: Dask MPI environment.

# envs/conda/dask.yaml
name: dask-mpi
dependencies:
  - python=3.8.10
  - pip:
    - dask[distributed,dataframe,bag]
    - dask_mpi
    - mpi4py
    - pandas
    - tqdm

Then register the environment:

az ml environment create --file envs/dask.yaml

Then you can use it in a commandJob:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: 
  local_path: src
command: >-
  python main.py --mpi
environment: azureml:dask-mpi
    
compute: azureml:cpu-cluster
distribution:
  type: mpi 
  process_count_per_instance: 2
resources:
  instance_count: 2
display_name: ...
experiment_name: ...
description: ...

I then have this snippet:

from dask.distributed import Client

def initialize_client(mpi: bool = True, n_workers: int = 8, scheduler_address: str = None) -> Client:
    if mpi:
        from dask_mpi import initialize
        initialize()
        return Client()

    client = Client(scheduler_address, n_workers=n_workers)
    return client

I just add an --mpi flag to my argparser that I include in the job yaml for AzureML as shown above and omit when running locally and then initialize the client using that snippet. It’s really nice because my code stays the exact same running locally and remotely.

Hope this is helpful, let me know if anything else is needed.

Top Results From Across the Web

Using low priority VMs in batch deployments - Azure

Batch deployment jobs consume low priority VMs by running on Azure Machine Learning compute clusters created with low priority VMs.

Clustering in Azure Machine Learning - SQLShack

This article will help you understand how to perform Clustering in Azure Machine Learning and how classification can be used with ...

Cost Management · Azure ML-Ops (Accelerator)

When you manage compute costs incurred from Azure Machine Learning, ... Introduce scheduling priority by creating clusters for multiple VM SKUs.

Managing your ML lifecycle with Azure Databricks ... - YouTube

Machine learning development has new complexities beyond software development. There are a myriad of tools and frameworks which make it hard ...

Azure: Machine Learning Service, Hadoop Storm, Cluster ...

Coming soon, SQL Database will introduce Dynamic Data Masking which is a policy-based security feature that helps limit the exposure of data in ......