question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Direct SageMaker support?

See original GitHub issue

What is the problem this feature will solve?

A lot of individuals and companies use SageMaker for model training and deployment, but often they are not experts in taking repositories like this one and understanding how to wrap them with SageMaker. So instead, they often default to examples they can find that are already integrated with SageMaker. However in the object detection space, these examples can often be much less capable than MMDetection.

What is the feature you are proposing to solve the problem?

Creating a tools/train_sagemaker.py and an example for training.

What alternatives have you considered?

Right now I have a train_sagemaker.py script that launches training by executingsubprocess.Popen with a command that uses torchrun to launch tools/train.py. For example:

    # Train script config
    launch_config = ["torchrun",
                     "--nnodes", str(world['number_of_machines']), "--node_rank", str(world['machine_rank']),
                     "--nproc_per_node", str(world['number_of_processes']), "--master_addr", world['master_addr'],
                     "--master_port", world['master_port']]

    train_config = [os.path.join(os.environ["MMDETECTION"], "tools/train.py"),
                    config_file,
                    "--launcher", "pytorch",
                    "--work-dir", '/opt/ml/checkpoints']

    if not args.validate:
        train_config.append("--no-validate")

    # Concat Pytorch Distributed Launch config and MMdetection config
    joint_cmd = " ".join(str(x) for x in launch_config+train_config)
    print("Following command will be executed: \n", joint_cmd)

    process = subprocess.Popen(joint_cmd,  stderr=subprocess.STDOUT, stdout=subprocess.PIPE, shell=True)

    while True:
        output = process.stdout.readline()

        if process.poll() is not None:
            break
        if output:
            print(output.decode("utf-8").strip())

    rc = process.poll()

    if process.returncode != 0:
        raise subprocess.CalledProcessError(returncode=process.returncode, cmd=joint_cmd)

But maybe there’s a better way to accomplish this and integrate it more directly?

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
austinmwcommented, Oct 23, 2022

No problem, thanks for working on this, let me know if you need any help!

0reactions
ZwwWaynecommented, Oct 23, 2022

Hi @austinmw , Thanks for your sharing. The code is definitely helpful to us. We will deeply check the code and may have a design in the following month. It might take several weeks as we do not have AWS services for now and we already have plans in this and the next month.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Create, Store, and Share Features with Amazon SageMaker ...
The offline store can help you store and serve features for exploration and model training. The online store retains only the latest feature...
Read more >
Ingesting Historical Feature Data into SageMaker Feature Store
In this blog post I show how to write historical feature data directly into S3, which is the backbone of the SMFS offline...
Read more >
Amazon SageMaker Feature Store Deep Dive Demo - YouTube
In this demo video, you'll learn how Amazon SageMaker Feature Store helps to store, update, retrieve, and share machine learning (ML) ...
Read more >
aws/amazon-sagemaker-examples - GitHub
GitHub - aws/amazon-sagemaker-examples: Example Jupyter notebooks that ... to directly deploy the best model to an endpoint to serve inference requests.
Read more >
What is Amazon SageMaker? - TechTarget
During this step, data is transformed to enable feature engineering. Deploy and analyze. When the model is ready for deployment, the service automatically ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found