Having trouble nni with frameworkcontroller on k8s
See original GitHub issueDescribe the issue: When I tried nni with frameworkcontroller on k8s, I used these yaml files
- I tried nfs
for nni config
config_framework.yml
authorName: default
experimentName: example_mnist_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: frameworkcontroller
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
assessor:
builtinAssessorName: Medianstop
classArgs:
optimize_mode: maximize
trial:
codeDir: .
taskRoles:
- name: worker
taskNum: 1
command: python3 mnist.py
gpuNum: 0
cpuNum: 1
memoryMB: 8192
image: msranni/nni:latest
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 3
minSucceededTaskCount: 1
frameworkcontrollerConfig:
storage: nfs
nfs:
# Your NFS server IP, like 10.10.10.10
server: 192.168.1.106
# Your NFS server export path, like /var/nfs/nni
path: /home/mj_lee/mount
serviceAccountName: frameworkcontroller
and for frameworkcontroller Statefulset
frameworkcontroller-with-default-config.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: frameworkcontroller
namespace: default
spec:
serviceName: frameworkcontroller
selector:
matchLabels:
app: frameworkcontroller
replicas: 1
template:
metadata:
labels:
app: frameworkcontroller
spec:
# Using the ServiceAccount with granted permission
# if the k8s cluster enforces authorization.
serviceAccountName: frameworkcontroller
containers:
- name: frameworkcontroller
image: frameworkcontroller/frameworkcontroller
# Using k8s inClusterConfig, so usually, no need to specify
# KUBE_APISERVER_ADDRESS or KUBECONFIG
env:
#- name: KUBE_APISERVER_ADDRESS
# value: {http[s]://host:port}
- name: KUBECONFIG
value: ~/.kube/config
and execute below command for k8s statefulset
kubectl apply -f frameworkcontroller-with-default-config.yaml
then frameworkcontroller-0 set to Run
and execute nnictl command
nnictl create --config config_framework.yml
then new experiment worker pod created
but it failed to run
when I check logs by kubectl logs nniexp~
so I checked the nfs mount directory
,
and there is not nni directory
, but It has envs
directory and run.sh
file
I think it should create nni/experiment_id/run.sh
in mount folder
here is describe of nniexp-worker-0
pod
Name: nniexpr2ys5f9aenvzchoa-worker-0
Namespace: default
Priority: 0
Node: zerooneai-p210908-4/192.168.1.104
Start Time: Fri, 25 Feb 2022 14:33:07 +0900
Labels: FC_FRAMEWORK_NAME=nniexpr2ys5f9aenvzchoa
FC_TASKROLE_NAME=worker
FC_TASK_INDEX=0
Annotations: FC_CONFIGMAP_NAME: nniexpr2ys5f9aenvzchoa-attempt
FC_CONFIGMAP_UID: 0be50971-55f8-434a-bfe4-6b47d64212eb
FC_FRAMEWORK_ATTEMPT_ID: 0
FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_0be50971-55f8-434a-bfe4-6b47d64212eb
FC_FRAMEWORK_NAME: nniexpr2ys5f9aenvzchoa
FC_FRAMEWORK_NAMESPACE: default
FC_FRAMEWORK_UID: 2c55ec33-69b8-43a4-a643-84ff4e0604b2
FC_POD_NAME: nniexpr2ys5f9aenvzchoa-worker-0
FC_TASKROLE_NAME: worker
FC_TASKROLE_UID: 751bf95b-c6bd-4dd0-aafe-e160f9c10220
FC_TASK_ATTEMPT_ID: 0
FC_TASK_INDEX: 0
FC_TASK_UID: 27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
cni.projectcalico.org/podIP: 10.0.243.33/32
cni.projectcalico.org/podIPs: 10.0.243.33/32
Status: Running
IP: 10.0.243.33
IPs:
IP: 10.0.243.33
Controlled By: ConfigMap/nniexpr2ys5f9aenvzchoa-attempt
Init Containers:
frameworkbarrier:
Container ID: docker://b05885b647cdb41dba4587f6f93eeb5bd19a390641687012bc017d73cc21aa79
Image: frameworkcontroller/frameworkbarrier
Image ID: docker-pullable://frameworkcontroller/frameworkbarrier@sha256:9d95e31152460e3cc5c7ad2b09738c1fdb540ff7a50abc72b2f8f9d0badb87da
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 25 Feb 2022 14:33:12 +0900
Finished: Fri, 25 Feb 2022 14:33:22 +0900
Ready: True
Restart Count: 0
Environment:
FC_FRAMEWORK_NAMESPACE: default
FC_FRAMEWORK_NAME: nniexpr2ys5f9aenvzchoa
FC_TASKROLE_NAME: worker
FC_TASK_INDEX: 0
FC_CONFIGMAP_NAME: nniexpr2ys5f9aenvzchoa-attempt
FC_POD_NAME: nniexpr2ys5f9aenvzchoa-worker-0
FC_FRAMEWORK_UID: 2c55ec33-69b8-43a4-a643-84ff4e0604b2
FC_FRAMEWORK_ATTEMPT_ID: 0
FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_0be50971-55f8-434a-bfe4-6b47d64212eb
FC_CONFIGMAP_UID: 0be50971-55f8-434a-bfe4-6b47d64212eb
FC_TASKROLE_UID: 751bf95b-c6bd-4dd0-aafe-e160f9c10220
FC_TASK_UID: 27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
FC_TASK_ATTEMPT_ID: 0
FC_POD_UID: (v1:metadata.uid)
FC_TASK_ATTEMPT_INSTANCE_UID: 0_$(FC_POD_UID)
Mounts:
/mnt/frameworkbarrier from frameworkbarrier-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Containers:
framework:
Container ID: docker://dc9ff6579c67e8bc394c734e8add70fbdb581d014541044d1877a9e5d888f828
Image: msranni/nni:latest
Image ID: docker-pullable://msranni/nni@sha256:8985fb134204ef523e113ac4a572ae7460cd246a5ff471df413f7d17dd917cd1
Port: 4000/TCP
Host Port: 0/TCP
Command:
sh
/tmp/mount/nni/r2ys5f9a/run.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Message: sh: 0: Can't open /tmp/mount/nni/r2ys5f9a/run.sh
Exit Code: 127
Started: Fri, 25 Feb 2022 14:36:43 +0900
Finished: Fri, 25 Feb 2022 14:36:43 +0900
Ready: False
Restart Count: 5
Limits:
cpu: 1
memory: 8Gi
Requests:
cpu: 1
memory: 8Gi
Environment:
FC_FRAMEWORK_NAMESPACE: default
FC_FRAMEWORK_NAME: nniexpr2ys5f9aenvzchoa
FC_TASKROLE_NAME: worker
FC_TASK_INDEX: 0
FC_CONFIGMAP_NAME: nniexpr2ys5f9aenvzchoa-attempt
FC_POD_NAME: nniexpr2ys5f9aenvzchoa-worker-0
FC_FRAMEWORK_UID: 2c55ec33-69b8-43a4-a643-84ff4e0604b2
FC_FRAMEWORK_ATTEMPT_ID: 0
FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_0be50971-55f8-434a-bfe4-6b47d64212eb
FC_CONFIGMAP_UID: 0be50971-55f8-434a-bfe4-6b47d64212eb
FC_TASKROLE_UID: 751bf95b-c6bd-4dd0-aafe-e160f9c10220
FC_TASK_UID: 27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
FC_TASK_ATTEMPT_ID: 0
FC_POD_UID: (v1:metadata.uid)
FC_TASK_ATTEMPT_INSTANCE_UID: 0_$(FC_POD_UID)
Mounts:
/mnt/frameworkbarrier from frameworkbarrier-volume (rw)
/tmp/mount from nni-vol (rw)
/var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nni-vol:
Type: NFS (an NFS mount that lasts the lifetime of a pod)
Server: 192.168.1.106
Path: /home/zerooneai/mj_lee/mount
ReadOnly: false
frameworkbarrier-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
frameworkcontroller-token-7sw6q:
Type: Secret (a volume populated by a Secret)
SecretName: frameworkcontroller-token-7sw6q
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m19s default-scheduler Successfully assigned default/nniexpr2ys5f9aenvzchoa-worker-0 to zerooneai-p210908-4
Normal Pulling 6m18s kubelet Pulling image "frameworkcontroller/frameworkbarrier"
Normal Pulled 6m15s kubelet Successfully pulled image "frameworkcontroller/frameworkbarrier" in 3.364620261s
Normal Created 6m14s kubelet Created container frameworkbarrier
Normal Started 6m14s kubelet Started container frameworkbarrier
Normal Pulled 6m1s kubelet Successfully pulled image "msranni/nni:latest" in 2.375328373s
Normal Pulled 5m56s kubelet Successfully pulled image "msranni/nni:latest" in 4.709013579s
Normal Pulled 5m36s kubelet Successfully pulled image "msranni/nni:latest" in 2.373976028s
Normal Pulling 5m9s (x4 over 6m4s) kubelet Pulling image "msranni/nni:latest"
Normal Created 5m7s (x4 over 6m1s) kubelet Created container framework
Normal Pulled 5m7s kubelet Successfully pulled image "msranni/nni:latest" in 2.484752039s
Normal Started 5m6s (x4 over 6m1s) kubelet Started container framework
Warning BackOff 71s (x22 over 5m54s) kubelet Back-off restarting failed container
please let me know how to solving this trouble thanks!
Environment:
- NNI version: 2.6
- Training service (local|remote|pai|aml|etc): frameworkcontroller
- Client OS: ubuntu 18.04
- Server OS (for remote mode only):
- Python version: 3.6.9
- PyTorch/TensorFlow version: 1.10.1+cu102
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
nni frameworkcontroller question, BarrierUnknownFailed error
In the logs that you send you can see: Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "... is ...
Read more >FrameworkController Training Service
Follow the guideline to set up FrameworkController in the Kubernetes cluster, NNI supports FrameworkController by the stateful set mode.
Read more >nni · PyPI
NNI (Neural Network Intelligence) is a toolkit to help users run automated machine learning (AutoML) experiments. The tool dispatches and runs trial jobs ......
Read more >A Simple Tutorial of Hyperparameter Tuning Using Microsoft NNI
NNI (Neural Network Intelligence) is a lightweight but powerful toolkit to help ... STEP 1: Define a Search Space in a YAML or...
Read more >Latest General Discussions topics - Discuss Kubernetes
Topic Replies Views Activity
Ingress not working · General Discussions 0 1420 October 8, 2021
Preferred order for applying addons? microk8s 0 281 October 1,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@vincenthp2603
Does “time” mean training duration? If so, this scenario didn’t happen to me, and I don’t think the concurrency will affect the training duration.
You can freeze random seeds(NumPy, torch, cuda, cudnn, et al) and set worker=1 to record the experiment baseline(batch size, epoch, model parameters, training duration). Then use concurrent mode to train the model and compare it with the baseline. Maybe the training duration is related to model complexity or training strategies(like Genetic Algorithm)?
You can see my changes here: https://github.com/microsoft/nni/pull/5045.
NNI v2.9 has been released.