Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Direct SageMaker support?

See original GitHub issue

What is the problem this feature will solve?

A lot of individuals and companies use SageMaker for model training and deployment, but often they are not experts in taking repositories like this one and understanding how to wrap them with SageMaker. So instead, they often default to examples they can find that are already integrated with SageMaker. However in the object detection space, these examples can often be much less capable than MMDetection.

What is the feature you are proposing to solve the problem?

Creating a tools/train_sagemaker.py and an example for training.

What alternatives have you considered?

Right now I have a train_sagemaker.py script that launches training by executingsubprocess.Popen with a command that uses torchrun to launch tools/train.py. For example:

    # Train script config
    launch_config = ["torchrun",
                     "--nnodes", str(world['number_of_machines']), "--node_rank", str(world['machine_rank']),
                     "--nproc_per_node", str(world['number_of_processes']), "--master_addr", world['master_addr'],
                     "--master_port", world['master_port']]

    train_config = [os.path.join(os.environ["MMDETECTION"], "tools/train.py"),
                    config_file,
                    "--launcher", "pytorch",
                    "--work-dir", '/opt/ml/checkpoints']

    if not args.validate:
        train_config.append("--no-validate")

    # Concat Pytorch Distributed Launch config and MMdetection config
    joint_cmd = " ".join(str(x) for x in launch_config+train_config)
    print("Following command will be executed: \n", joint_cmd)

    process = subprocess.Popen(joint_cmd,  stderr=subprocess.STDOUT, stdout=subprocess.PIPE, shell=True)

    while True:
        output = process.stdout.readline()

        if process.poll() is not None:
            break
        if output:
            print(output.decode("utf-8").strip())

    rc = process.poll()

    if process.returncode != 0:
        raise subprocess.CalledProcessError(returncode=process.returncode, cmd=joint_cmd)

But maybe there’s a better way to accomplish this and integrate it more directly?

Issue Analytics

State:
Created a year ago
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

austinmwcommented, Oct 23, 2022

No problem, thanks for working on this, let me know if you need any help!

0reactions

ZwwWaynecommented, Oct 23, 2022

Hi @austinmw , Thanks for your sharing. The code is definitely helpful to us. We will deeply check the code and may have a design in the following month. It might take several weeks as we do not have AWS services for now and we already have plans in this and the next month.

Top Results From Across the Web

Create, Store, and Share Features with Amazon SageMaker ...

The offline store can help you store and serve features for exploration and model training. The online store retains only the latest feature...

Ingesting Historical Feature Data into SageMaker Feature Store

In this blog post I show how to write historical feature data directly into S3, which is the backbone of the SMFS offline...

Amazon SageMaker Feature Store Deep Dive Demo - YouTube

In this demo video, you'll learn how Amazon SageMaker Feature Store helps to store, update, retrieve, and share machine learning (ML) ...

aws/amazon-sagemaker-examples - GitHub

GitHub - aws/amazon-sagemaker-examples: Example Jupyter notebooks that ... to directly deploy the best model to an endpoint to serve inference requests.

What is Amazon SageMaker? - TechTarget

During this step, data is transformed to enable feature engineering. Deploy and analyze. When the model is ready for deployment, the service automatically ......