question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Having trouble nni with frameworkcontroller on k8s

See original GitHub issue

Describe the issue: When I tried nni with frameworkcontroller on k8s, I used these yaml files

  • I tried nfs

for nni config config_framework.yml

authorName: default
experimentName: example_mnist_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: frameworkcontroller
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: maximize
assessor:
  builtinAssessorName: Medianstop
  classArgs:
    optimize_mode: maximize
trial:
  codeDir: .
  taskRoles:
    - name: worker
      taskNum: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
      frameworkAttemptCompletionPolicy:
        minFailedTaskCount: 3
        minSucceededTaskCount: 1
frameworkcontrollerConfig:
  storage: nfs
  nfs:
    # Your NFS server IP, like 10.10.10.10
    server: 192.168.1.106
    # Your NFS server export path, like /var/nfs/nni
    path: /home/mj_lee/mount
  serviceAccountName: frameworkcontroller

and for frameworkcontroller Statefulset frameworkcontroller-with-default-config.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: frameworkcontroller
  namespace: default
spec:
  serviceName: frameworkcontroller
  selector:
    matchLabels:
      app: frameworkcontroller
  replicas: 1
  template:
    metadata:
      labels:
        app: frameworkcontroller
    spec:
      # Using the ServiceAccount with granted permission
      # if the k8s cluster enforces authorization.
      serviceAccountName: frameworkcontroller
      containers:
      - name: frameworkcontroller
        image: frameworkcontroller/frameworkcontroller
        # Using k8s inClusterConfig, so usually, no need to specify
        # KUBE_APISERVER_ADDRESS or KUBECONFIG
        env:
        #- name: KUBE_APISERVER_ADDRESS
        #  value: {http[s]://host:port}
          - name: KUBECONFIG
            value: ~/.kube/config

and execute below command for k8s statefulset

kubectl apply -f frameworkcontroller-with-default-config.yaml

then frameworkcontroller-0 set to Run

image

and execute nnictl command

nnictl create --config config_framework.yml

then new experiment worker pod created
but it failed to run

image

when I check logs by kubectl logs nniexp~

image

so I checked the nfs mount directory, and there is not nni directory, but It has envs directory and run.sh file

image

I think it should create nni/experiment_id/run.sh in mount folder

here is describe of nniexp-worker-0 pod

Name:         nniexpr2ys5f9aenvzchoa-worker-0
Namespace:    default
Priority:     0
Node:         zerooneai-p210908-4/192.168.1.104
Start Time:   Fri, 25 Feb 2022 14:33:07 +0900
Labels:       FC_FRAMEWORK_NAME=nniexpr2ys5f9aenvzchoa
              FC_TASKROLE_NAME=worker
              FC_TASK_INDEX=0
Annotations:  FC_CONFIGMAP_NAME: nniexpr2ys5f9aenvzchoa-attempt
              FC_CONFIGMAP_UID: 0be50971-55f8-434a-bfe4-6b47d64212eb
              FC_FRAMEWORK_ATTEMPT_ID: 0
              FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_0be50971-55f8-434a-bfe4-6b47d64212eb
              FC_FRAMEWORK_NAME: nniexpr2ys5f9aenvzchoa
              FC_FRAMEWORK_NAMESPACE: default
              FC_FRAMEWORK_UID: 2c55ec33-69b8-43a4-a643-84ff4e0604b2
              FC_POD_NAME: nniexpr2ys5f9aenvzchoa-worker-0
              FC_TASKROLE_NAME: worker
              FC_TASKROLE_UID: 751bf95b-c6bd-4dd0-aafe-e160f9c10220
              FC_TASK_ATTEMPT_ID: 0
              FC_TASK_INDEX: 0
              FC_TASK_UID: 27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
              cni.projectcalico.org/podIP: 10.0.243.33/32
              cni.projectcalico.org/podIPs: 10.0.243.33/32
Status:       Running
IP:           10.0.243.33
IPs:
  IP:           10.0.243.33
Controlled By:  ConfigMap/nniexpr2ys5f9aenvzchoa-attempt
Init Containers:
  frameworkbarrier:
    Container ID:   docker://b05885b647cdb41dba4587f6f93eeb5bd19a390641687012bc017d73cc21aa79
    Image:          frameworkcontroller/frameworkbarrier
    Image ID:       docker-pullable://frameworkcontroller/frameworkbarrier@sha256:9d95e31152460e3cc5c7ad2b09738c1fdb540ff7a50abc72b2f8f9d0badb87da
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 25 Feb 2022 14:33:12 +0900
      Finished:     Fri, 25 Feb 2022 14:33:22 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      FC_FRAMEWORK_NAMESPACE:             default
      FC_FRAMEWORK_NAME:                  nniexpr2ys5f9aenvzchoa
      FC_TASKROLE_NAME:                   worker
      FC_TASK_INDEX:                      0
      FC_CONFIGMAP_NAME:                  nniexpr2ys5f9aenvzchoa-attempt
      FC_POD_NAME:                        nniexpr2ys5f9aenvzchoa-worker-0
      FC_FRAMEWORK_UID:                   2c55ec33-69b8-43a4-a643-84ff4e0604b2
      FC_FRAMEWORK_ATTEMPT_ID:            0
      FC_FRAMEWORK_ATTEMPT_INSTANCE_UID:  0_0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_CONFIGMAP_UID:                   0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_TASKROLE_UID:                    751bf95b-c6bd-4dd0-aafe-e160f9c10220
      FC_TASK_UID:                        27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
      FC_TASK_ATTEMPT_ID:                 0
      FC_POD_UID:                          (v1:metadata.uid)
      FC_TASK_ATTEMPT_INSTANCE_UID:       0_$(FC_POD_UID)
    Mounts:
      /mnt/frameworkbarrier from frameworkbarrier-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Containers:
  framework:
    Container ID:  docker://dc9ff6579c67e8bc394c734e8add70fbdb581d014541044d1877a9e5d888f828
    Image:         msranni/nni:latest
    Image ID:      docker-pullable://msranni/nni@sha256:8985fb134204ef523e113ac4a572ae7460cd246a5ff471df413f7d17dd917cd1
    Port:          4000/TCP
    Host Port:     0/TCP
    Command:
      sh
      /tmp/mount/nni/r2ys5f9a/run.sh
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   sh: 0: Can't open /tmp/mount/nni/r2ys5f9a/run.sh

      Exit Code:    127
      Started:      Fri, 25 Feb 2022 14:36:43 +0900
      Finished:     Fri, 25 Feb 2022 14:36:43 +0900
    Ready:          False
    Restart Count:  5
    Limits:
      cpu:     1
      memory:  8Gi
    Requests:
      cpu:     1
      memory:  8Gi
    Environment:
      FC_FRAMEWORK_NAMESPACE:             default
      FC_FRAMEWORK_NAME:                  nniexpr2ys5f9aenvzchoa
      FC_TASKROLE_NAME:                   worker
      FC_TASK_INDEX:                      0
      FC_CONFIGMAP_NAME:                  nniexpr2ys5f9aenvzchoa-attempt
      FC_POD_NAME:                        nniexpr2ys5f9aenvzchoa-worker-0
      FC_FRAMEWORK_UID:                   2c55ec33-69b8-43a4-a643-84ff4e0604b2
      FC_FRAMEWORK_ATTEMPT_ID:            0
      FC_FRAMEWORK_ATTEMPT_INSTANCE_UID:  0_0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_CONFIGMAP_UID:                   0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_TASKROLE_UID:                    751bf95b-c6bd-4dd0-aafe-e160f9c10220
      FC_TASK_UID:                        27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
      FC_TASK_ATTEMPT_ID:                 0
      FC_POD_UID:                          (v1:metadata.uid)
      FC_TASK_ATTEMPT_INSTANCE_UID:       0_$(FC_POD_UID)
    Mounts:
      /mnt/frameworkbarrier from frameworkbarrier-volume (rw)
      /tmp/mount from nni-vol (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  nni-vol:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    192.168.1.106
    Path:      /home/zerooneai/mj_lee/mount
    ReadOnly:  false
  frameworkbarrier-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  frameworkcontroller-token-7sw6q:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  frameworkcontroller-token-7sw6q
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  6m19s                 default-scheduler  Successfully assigned default/nniexpr2ys5f9aenvzchoa-worker-0 to zerooneai-p210908-4
  Normal   Pulling    6m18s                 kubelet            Pulling image "frameworkcontroller/frameworkbarrier"
  Normal   Pulled     6m15s                 kubelet            Successfully pulled image "frameworkcontroller/frameworkbarrier" in 3.364620261s
  Normal   Created    6m14s                 kubelet            Created container frameworkbarrier
  Normal   Started    6m14s                 kubelet            Started container frameworkbarrier
  Normal   Pulled     6m1s                  kubelet            Successfully pulled image "msranni/nni:latest" in 2.375328373s
  Normal   Pulled     5m56s                 kubelet            Successfully pulled image "msranni/nni:latest" in 4.709013579s
  Normal   Pulled     5m36s                 kubelet            Successfully pulled image "msranni/nni:latest" in 2.373976028s
  Normal   Pulling    5m9s (x4 over 6m4s)   kubelet            Pulling image "msranni/nni:latest"
  Normal   Created    5m7s (x4 over 6m1s)   kubelet            Created container framework
  Normal   Pulled     5m7s                  kubelet            Successfully pulled image "msranni/nni:latest" in 2.484752039s
  Normal   Started    5m6s (x4 over 6m1s)   kubelet            Started container framework
  Warning  BackOff    71s (x22 over 5m54s)  kubelet            Back-off restarting failed container

please let me know how to solving this trouble thanks!

Environment:

  • NNI version: 2.6
  • Training service (local|remote|pai|aml|etc): frameworkcontroller
  • Client OS: ubuntu 18.04
  • Server OS (for remote mode only):
  • Python version: 3.6.9
  • PyTorch/TensorFlow version: 1.10.1+cu102

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
amznerocommented, Aug 3, 2022

@vincenthp2603

but the trials then take more time than when there’s only 1 worker.

Does “time” mean training duration? If so, this scenario didn’t happen to me, and I don’t think the concurrency will affect the training duration.

You can freeze random seeds(NumPy, torch, cuda, cudnn, et al) and set worker=1 to record the experiment baseline(batch size, epoch, model parameters, training duration). Then use concurrent mode to train the model and compare it with the baseline. Maybe the training duration is related to model complexity or training strategies(like Genetic Algorithm)?


You can see my changes here: https://github.com/microsoft/nni/pull/5045.

0reactions
liuzhe-lzcommented, Sep 8, 2022

NNI v2.9 has been released.

Read more comments on GitHub >

github_iconTop Results From Across the Web

nni frameworkcontroller question, BarrierUnknownFailed error
In the logs that you send you can see: Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "... is ...
Read more >
FrameworkController Training Service
Follow the guideline to set up FrameworkController in the Kubernetes cluster, NNI supports FrameworkController by the stateful set mode.
Read more >
nni · PyPI
NNI (Neural Network Intelligence) is a toolkit to help users run automated machine learning (AutoML) experiments. The tool dispatches and runs trial jobs ......
Read more >
A Simple Tutorial of Hyperparameter Tuning Using Microsoft NNI
NNI (Neural Network Intelligence) is a lightweight but powerful toolkit to help ... STEP 1: Define a Search Space in a YAML or...
Read more >
Latest General Discussions topics - Discuss Kubernetes
Topic Replies Views Activity Ingress not working · General Discussions 0 1420 October 8, 2021 Preferred order for applying addons? microk8s 0 281 October 1,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found