DGL Operator: Leverage DGL on K8s
See original GitHub issueThis is Xiaoyu Zhai, from Qihoo 360 AI Infra. Recently there are some internal demands on DGL/DGL-KE framework in our AI/ML teams, so we just kick off the research on distributed DGL training.
The native distributed DGL training is based on the machine level, you need to manually set up ip config, grant passwordless ssh access, use copy_files.py
to dispatch your partition data, and use launch.py
to invoke your training. But what we want to offer to our users, is automatically training distributed, and most important is that the workload can be orchestrated on K8s. So we decide to develop a “DGL Operator”, to leverage DGL training on K8s. It can cover distributed scaffolding tools for ML engineers, they only need to work on partition script and train script.
The first version of DGL Operator will be finished by end of this month, and we are glad to open source our project, let more and more developers can involve in DGL or use DGL on K8s. However, I have a question, which is the main subject of this issue, is dmlc
willing to host our project? I noticed that dmlc
usually does not host any golang projects, but its ok, we can also contribute this Operator to Kubeflow Community (XGBoost Operator is hosted by Kubeflow).
Looking forward to having you guys any response, be free to ping me.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (1 by maintainers)
After talked with @zheng-da and had an internal discussion in our team, we decided to contribute the DGL Operator repo to Kubeflow community, because 1) DGL Operator is a Golang project and Kubernetes infra, contributing to Kubeflow may touch more Golang and Kubernetes engineers; 2) Kubeflow community have a lot of experienced Golang and Kubernetes engineers, can stay together to improve the stability and high-level design.
We have already submitted the proposal to Kubeflow community, please let me know if there is any issue or concern.
Proposal PR: https://github.com/kubeflow/community/pull/512 Proposal reading friendly: https://github.com/ryantd/community/blob/dgl-operator/proposals/dgl-operator-proposal.md
This issue is closed due to lack of activity. Feel free to reopen it if you still have questions.