question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Katib stuck on the first trial and the experiment keeps running without creating the next trial

See original GitHub issue

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.]

I’ve created an experiment using the script below. Experiment, trials, suggestion, pods are running fine.

However, the katib only created one trial and stuck on it without trying a second trial with new parameter from suggestion. there was no new suggestion after first trial.

Here are the yaml file and status of them while running.

  1. experiment yaml - using gpu , tfjob

  2. experiment status

  3. suggestions status

  4. model python script for tfjob container

  5. experiment yaml

apiVersion: “kubeflow.org/v1beta1” kind: Experiment metadata: namespace: kubeflow name: tfjob-example spec: parallelTrialCount: 1 maxTrialCount: 10 maxFailedTrialCount: 10 objective: type: minimize goal: 0 objectiveMetricName: val_loss algorithm: algorithmName: random metricsCollectorSpec: source: fileSystemPath: path: /train kind: Directory collector: kind: TensorFlowEvent parameters: - name: batch_size parameterType: int feasibleSpace: min: “16” max: “20” trialTemplate: primaryContainerName: tensorflow trialParameters: - name: batchSize description: Batch Size reference: batch_size trialSpec: apiVersion: “kubeflow.org/v1” kind: TFJob spec: tfReplicaSpecs: Worker: replicas: 1 restartPolicy: OnFailure template: spec: spec: containers: - command: - python - TFTemplate.py - --file_name=model_for_katib2 - --work_dir=/kn/data - --train=$(DATADIR) - --model_name=kn-cnn - --epochs=5 - --log_dir=/kn/data - --batch_size=${trialParameters.batchSize} env: - name: DATADIR valueFrom: configMapKeyRef: name: configmap key: datadir image: seungkyulee/kn_tf_gpu_no_template:2.0 name: tensorflow volumeMounts: - mountPath: /kn/data name: volume workingDir: /kn/data resources: limits: nvidia.com/gpu: 1 restartPolicy: Never volumes: - name: volume persistentVolumeClaim: claimName: tfpvc

  1. experiment status

ubuntu@ip-172-16-1-204:~/katib$ kubectl describe experiment tfjob-example -n kubeflow Name: tfjob-example Namespace: kubeflow Labels: <none> Annotations: kubectl.kubernetes.io/last-applied-configuration: {“apiVersion”:“kubeflow.org/v1beta1”,“kind”:“Experiment”,“metadata”:{“annotations”:{},“name”:“tfjob-example”,“namespace”:“kubeflow”},"spec… API Version: kubeflow.org/v1beta1 Kind: Experiment Metadata: Creation Timestamp: 2021-01-12T06:14:29Z Finalizers: update-prometheus-metrics Generation: 1 Resource Version: 1048867 Self Link: /apis/kubeflow.org/v1beta1/namespaces/kubeflow/experiments/tfjob-example UID: e10cadae-7b6b-49a9-8a84-a4aba13d26df Spec: Algorithm: Algorithm Name: random Max Failed Trial Count: 10 Max Trial Count: 10 Metrics Collector Spec: Collector: Kind: TensorFlowEvent Source: File System Path: Kind: Directory Path: /train Objective: Goal: 0 Objective Metric Name: val_loss Type: minimize Parallel Trial Count: 1 Parameters: Feasible Space: Max: 20 Min: 16 Name: batch_size Parameter Type: int Trial Template: Primary Container Name: tensorflow Trial Parameters: Description: Batch Size Name: batchSize Reference: batch_size Trial Spec: API Version: kubeflow.org/v1 Kind: TFJob Spec: Tf Replica Specs: Worker: Replicas: 1 Restart Policy: OnFailure Template: Spec: Containers: Command: python TFTemplate.py –file_name=model_for_katib2 –work_dir=/kn/data –train=$(DATADIR) –model_name=kn-cnn –epochs=5 –log_dir=/kn/data –batch_size=${trialParameters.batchSize} Env: Name: DATADIR Value From: Config Map Key Ref: Key: datadir Name: configmap Image: seungkyulee/kn_tf_gpu_no_template:2.0 Name: tensorflow Resources: Limits: nvidia.com/gpu: 1 Volume Mounts: Mount Path: /kn/data Name: volume Working Dir: /kn/data Restart Policy: Never Volumes: Name: volume Persistent Volume Claim: Claim Name: tfpvc Status: Conditions: Last Transition Time: 2021-01-12T06:14:29Z Last Update Time: 2021-01-12T06:14:29Z Message: Experiment is created Reason: ExperimentCreated Status: True Type: Created Last Transition Time: 2021-01-12T06:14:50Z Last Update Time: 2021-01-12T06:14:50Z Message: Experiment is running Reason: ExperimentRunning Status: True Type: Running Current Optimal Trial: Best Trial Name:
Observation: Metrics: <nil> Parameter Assignments: <nil> Running Trial List: tfjob-example-8v9gt8v8 Start Time: 2021-01-12T06:14:29Z Trials: 1 Trials Running: 1 Events: <none>

  1. suggestions status

ubuntu@ip-172-16-1-204:~/katib$ kubectl describe suggestions tfjob-example -n kubeflow

Name: tfjob-example Namespace: kubeflow Labels: <none> Annotations: kubectl.kubernetes.io/last-applied-configuration: {“apiVersion”:“kubeflow.org/v1beta1”,“kind”:“Experiment”,“metadata”:{“annotations”:{},“name”:“tfjob-example”,“namespace”:“kubeflow”},"spec… API Version: kubeflow.org/v1beta1 Kind: Suggestion Metadata: Creation Timestamp: 2021-01-12T06:14:29Z Generation: 1 Owner References: API Version: kubeflow.org/v1beta1 Block Owner Deletion: true Controller: true Kind: Experiment Name: tfjob-example UID: e10cadae-7b6b-49a9-8a84-a4aba13d26df Resource Version: 1048857 Self Link: /apis/kubeflow.org/v1beta1/namespaces/kubeflow/suggestions/tfjob-example UID: 029e0905-7af3-43a5-8976-bbaafd83783a Spec: Algorithm: Algorithm Name: random Requests: 1 Status: Conditions: Last Transition Time: 2021-01-12T06:14:29Z Last Update Time: 2021-01-12T06:14:29Z Message: Suggestion is created Reason: SuggestionCreated Status: True Type: Created Last Transition Time: 2021-01-12T06:14:49Z Last Update Time: 2021-01-12T06:14:49Z Message: Deployment is ready Reason: DeploymentReady Status: True Type: DeploymentReady Last Transition Time: 2021-01-12T06:14:49Z Last Update Time: 2021-01-12T06:14:49Z Message: Suggestion is running Reason: SuggestionRunning Status: True Type: Running Start Time: 2021-01-12T06:14:29Z Suggestion Count: 1 Suggestions: Name: tfjob-example-8v9gt8v8 Parameter Assignments: Name: batch_size Value: 16 Events: <none>

What did you expect to happen:

I wanted the katib create new trial with new parameter which here in example batch size.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

also does anyone know how to find the directory and what is does

source:
  fileSystemPath:
    path: /train
    kind: Directory

status of first trial

ubuntu@ip-172-16-1-204:~/katib$ kubectl logs tfjob-example-8v9gt8v8-worker-0 -n kubeflow -c tensorflow

2021-01-12 06:14:54.274112: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-01-12 06:14:56.530644: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-12 06:14:56.531765: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-01-12 06:14:56.556211: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:56.557239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2021-01-12 06:14:56.557279: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-01-12 06:14:56.560870: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-01-12 06:14:56.560925: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-01-12 06:14:56.562425: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-01-12 06:14:56.562733: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-01-12 06:14:56.566658: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-01-12 06:14:56.567536: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-01-12 06:14:56.567745: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-01-12 06:14:56.567876: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:56.568916: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:56.569861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-01-12 06:14:56.569910: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-01-12 06:14:57.499897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-01-12 06:14:57.499958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-01-12 06:14:57.499971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-01-12 06:14:57.500311: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.501411: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.502432: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.503399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14760 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0) 2021-01-12 06:14:57.532771: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-12 06:14:57.532935: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.533887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2021-01-12 06:14:57.533946: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-01-12 06:14:57.534012: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-01-12 06:14:57.534045: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-01-12 06:14:57.534073: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-01-12 06:14:57.534093: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-01-12 06:14:57.534112: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-01-12 06:14:57.534150: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-01-12 06:14:57.534188: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-01-12 06:14:57.534291: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.535313: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.536244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-01-12 06:14:57.536523: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-12 06:14:57.536640: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.537636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2021-01-12 06:14:57.537673: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-01-12 06:14:57.537696: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-01-12 06:14:57.537729: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-01-12 06:14:57.537749: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-01-12 06:14:57.537769: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-01-12 06:14:57.537794: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-01-12 06:14:57.537818: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-01-12 06:14:57.537838: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-01-12 06:14:57.537916: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.538900: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.539790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-01-12 06:14:57.539828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-01-12 06:14:57.539838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-01-12 06:14:57.539854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-01-12 06:14:57.539958: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.540987: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.541937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14760 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0) 2021-01-12 06:14:57.924484: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2021-01-12 06:14:57.925171: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300070000 Hz 2021-01-12 06:14:58.481369: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-01-12 06:14:58.937601: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-01-12 06:14:58.940892: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 #Parameter list -------------------------------- (Known) Namespace(activation=‘relu’, batch_size=16, epochs=5, file_name=‘model_for_katib2’, learning_rate=0, log_dir=‘/kn/data’, model_dir=‘/kn/data/models’, model_ver=0.3, train=‘/kn/data/dataset/region1’, work_dir=‘/kn/data’) -------------------------------- (Unknown) [‘–model_name=kn-cnn’]

Found 70 files belonging to 4 classes. Using 4 files for training. Found 70 files belonging to 4 classes. Using 66 files for validation. Epoch 1/5 1/1 [==============================] - 4s 4s/step - loss: 1.3791 - accuracy: 0.2500 - val_loss: 0.8957 - val_accuracy: 0.8939 Epoch 2/5 1/1 [==============================] - 0s 40ms/step - loss: 0.8655 - accuracy: 1.0000 - val_loss: 0.3944 - val_accuracy: 0.8939 Epoch 3/5 1/1 [==============================] - 0s 40ms/step - loss: 0.3163 - accuracy: 1.0000 - val_loss: 0.2012 - val_accuracy: 0.9091 Epoch 4/5 1/1 [==============================] - 0s 66ms/step - loss: 0.0639 - accuracy: 1.0000 - val_loss: 0.1549 - val_accuracy: 0.9545 Epoch 5/5 1/1 [==============================] - 0s 40ms/step - loss: 0.0069 - accuracy: 1.0000 - val_loss: 0.1389 - val_accuracy: 0.9545 2021-01-12 06:15:01.874995: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-12 06:15:01.875337: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:15:01.876341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2021-01-12 06:15:01.876408: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-01-12 06:15:01.876458: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-01-12 06:15:01.876493: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-01-12 06:15:01.876525: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-01-12 06:15:01.876557: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-01-12 06:15:01.876590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-01-12 06:15:01.876624: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-01-12 06:15:01.876658: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-01-12 06:15:01.876757: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:15:01.877778: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:15:01.878652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-01-12 06:15:01.878699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-01-12 06:15:01.878712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-01-12 06:15:01.878720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-01-12 06:15:01.878848: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:15:01.879784: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:15:01.880651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 14760 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0) 2021-01-12 06:15:02.834707: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them. [name: “/device:CPU:0” device_type: “CPU” memory_limit: 268435456 locality { } incarnation: 9325939197827833295 , name: “/device:GPU:0” device_type: “GPU” memory_limit: 15477595200 locality { bus_id: 1 links { } } incarnation: 4128588976737718326 physical_device_desc: “device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0” ] <tensorflow.python.keras.callbacks.History object at 0x7fe078318668> Found 70 files belonging to 4 classes. [[18 0 0 0] [ 0 20 0 0] [ 0 0 16 0] [ 0 3 0 13]] #Save model : /kn/data/models/0.3 #Saved cm : /kn/data/models/cm

Environment:

  • Kubeflow version (kfctl version): 1.2
  • AWS ec2 p3xlarge
  • Kubernetes version: (use kubectl version): 1.15
  • OS (e.g. from /etc/os-release): ubuntu 18.04

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
andreyvelichcommented, Feb 9, 2021

@rky0930 It’s great! Feel free to open new issue if you have any other problems.

1reaction
rky0930commented, Feb 9, 2021

@andreyvelich Thanks for your help! Now it works!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting Started with Katib - Kubeflow
This guide shows how to get started with Katib and run a few examples using the command line and the Katib user interface...
Read more >
Katib - ai-ml - CERN GitLab
Katib supports Hyperparameter Tuning, Early Stopping and Neural Architecture Search. Katib is the project which is agnostic to machine learning (ML) frameworks.
Read more >
A Scalable and Cloud-Native Hyperparameter Tuning System
ABSTRACT. In this paper, we introduce Katib: a scalable, cloud-native, and production-ready hyperparameter tuning system that is agnostic.
Read more >
Experience Graphs: Leveraging Experience in Planning
4.10 Detailed plots of Full Body Experiments on a static kitchen without a doorway. Each plot shows the planning time of E-Graphs on...
Read more >
News Flash - Mesquite, TX
Latimore Park is named after Officer Jon Latimore, the first African American police officer to serve in the Mesquite Police Department. Officer Latimore...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found