Katib stuck on the first trial and the experiment keeps running without creating the next trial
See original GitHub issue/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.]
I’ve created an experiment using the script below. Experiment, trials, suggestion, pods are running fine.
However, the katib only created one trial and stuck on it without trying a second trial with new parameter from suggestion. there was no new suggestion after first trial.
Here are the yaml file and status of them while running.
-
experiment yaml - using gpu , tfjob
-
experiment status
-
suggestions status
-
model python script for tfjob container
-
experiment yaml
apiVersion: “kubeflow.org/v1beta1” kind: Experiment metadata: namespace: kubeflow name: tfjob-example spec: parallelTrialCount: 1 maxTrialCount: 10 maxFailedTrialCount: 10 objective: type: minimize goal: 0 objectiveMetricName: val_loss algorithm: algorithmName: random metricsCollectorSpec: source: fileSystemPath: path: /train kind: Directory collector: kind: TensorFlowEvent parameters: - name: batch_size parameterType: int feasibleSpace: min: “16” max: “20” trialTemplate: primaryContainerName: tensorflow trialParameters: - name: batchSize description: Batch Size reference: batch_size trialSpec: apiVersion: “kubeflow.org/v1” kind: TFJob spec: tfReplicaSpecs: Worker: replicas: 1 restartPolicy: OnFailure template: spec: spec: containers: - command: - python - TFTemplate.py - --file_name=model_for_katib2 - --work_dir=/kn/data - --train=$(DATADIR) - --model_name=kn-cnn - --epochs=5 - --log_dir=/kn/data - --batch_size=${trialParameters.batchSize} env: - name: DATADIR valueFrom: configMapKeyRef: name: configmap key: datadir image: seungkyulee/kn_tf_gpu_no_template:2.0 name: tensorflow volumeMounts: - mountPath: /kn/data name: volume workingDir: /kn/data resources: limits: nvidia.com/gpu: 1 restartPolicy: Never volumes: - name: volume persistentVolumeClaim: claimName: tfpvc
- experiment status
ubuntu@ip-172-16-1-204:~/katib$ kubectl describe experiment tfjob-example -n kubeflow
Name: tfjob-example
Namespace: kubeflow
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{“apiVersion”:“kubeflow.org/v1beta1”,“kind”:“Experiment”,“metadata”:{“annotations”:{},“name”:“tfjob-example”,“namespace”:“kubeflow”},"spec…
API Version: kubeflow.org/v1beta1
Kind: Experiment
Metadata:
Creation Timestamp: 2021-01-12T06:14:29Z
Finalizers:
update-prometheus-metrics
Generation: 1
Resource Version: 1048867
Self Link: /apis/kubeflow.org/v1beta1/namespaces/kubeflow/experiments/tfjob-example
UID: e10cadae-7b6b-49a9-8a84-a4aba13d26df
Spec:
Algorithm:
Algorithm Name: random
Max Failed Trial Count: 10
Max Trial Count: 10
Metrics Collector Spec:
Collector:
Kind: TensorFlowEvent
Source:
File System Path:
Kind: Directory
Path: /train
Objective:
Goal: 0
Objective Metric Name: val_loss
Type: minimize
Parallel Trial Count: 1
Parameters:
Feasible Space:
Max: 20
Min: 16
Name: batch_size
Parameter Type: int
Trial Template:
Primary Container Name: tensorflow
Trial Parameters:
Description: Batch Size
Name: batchSize
Reference: batch_size
Trial Spec:
API Version: kubeflow.org/v1
Kind: TFJob
Spec:
Tf Replica Specs:
Worker:
Replicas: 1
Restart Policy: OnFailure
Template:
Spec:
Containers:
Command:
python
TFTemplate.py
–file_name=model_for_katib2
–work_dir=/kn/data
–train=$(DATADIR)
–model_name=kn-cnn
–epochs=5
–log_dir=/kn/data
–batch_size=${trialParameters.batchSize}
Env:
Name: DATADIR
Value From:
Config Map Key Ref:
Key: datadir
Name: configmap
Image: seungkyulee/kn_tf_gpu_no_template:2.0
Name: tensorflow
Resources:
Limits:
nvidia.com/gpu: 1
Volume Mounts:
Mount Path: /kn/data
Name: volume
Working Dir: /kn/data
Restart Policy: Never
Volumes:
Name: volume
Persistent Volume Claim:
Claim Name: tfpvc
Status:
Conditions:
Last Transition Time: 2021-01-12T06:14:29Z
Last Update Time: 2021-01-12T06:14:29Z
Message: Experiment is created
Reason: ExperimentCreated
Status: True
Type: Created
Last Transition Time: 2021-01-12T06:14:50Z
Last Update Time: 2021-01-12T06:14:50Z
Message: Experiment is running
Reason: ExperimentRunning
Status: True
Type: Running
Current Optimal Trial:
Best Trial Name:
Observation:
Metrics: <nil>
Parameter Assignments: <nil>
Running Trial List:
tfjob-example-8v9gt8v8
Start Time: 2021-01-12T06:14:29Z
Trials: 1
Trials Running: 1
Events: <none>
- suggestions status
ubuntu@ip-172-16-1-204:~/katib$ kubectl describe suggestions tfjob-example -n kubeflow
Name: tfjob-example Namespace: kubeflow Labels: <none> Annotations: kubectl.kubernetes.io/last-applied-configuration: {“apiVersion”:“kubeflow.org/v1beta1”,“kind”:“Experiment”,“metadata”:{“annotations”:{},“name”:“tfjob-example”,“namespace”:“kubeflow”},"spec… API Version: kubeflow.org/v1beta1 Kind: Suggestion Metadata: Creation Timestamp: 2021-01-12T06:14:29Z Generation: 1 Owner References: API Version: kubeflow.org/v1beta1 Block Owner Deletion: true Controller: true Kind: Experiment Name: tfjob-example UID: e10cadae-7b6b-49a9-8a84-a4aba13d26df Resource Version: 1048857 Self Link: /apis/kubeflow.org/v1beta1/namespaces/kubeflow/suggestions/tfjob-example UID: 029e0905-7af3-43a5-8976-bbaafd83783a Spec: Algorithm: Algorithm Name: random Requests: 1 Status: Conditions: Last Transition Time: 2021-01-12T06:14:29Z Last Update Time: 2021-01-12T06:14:29Z Message: Suggestion is created Reason: SuggestionCreated Status: True Type: Created Last Transition Time: 2021-01-12T06:14:49Z Last Update Time: 2021-01-12T06:14:49Z Message: Deployment is ready Reason: DeploymentReady Status: True Type: DeploymentReady Last Transition Time: 2021-01-12T06:14:49Z Last Update Time: 2021-01-12T06:14:49Z Message: Suggestion is running Reason: SuggestionRunning Status: True Type: Running Start Time: 2021-01-12T06:14:29Z Suggestion Count: 1 Suggestions: Name: tfjob-example-8v9gt8v8 Parameter Assignments: Name: batch_size Value: 16 Events: <none>
What did you expect to happen:
I wanted the katib create new trial with new parameter which here in example batch size.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
also does anyone know how to find the directory and what is does
source:
fileSystemPath:
path: /train
kind: Directory
status of first trial
ubuntu@ip-172-16-1-204:~/katib$ kubectl logs tfjob-example-8v9gt8v8-worker-0 -n kubeflow -c tensorflow
2021-01-12 06:14:54.274112: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-01-12 06:14:56.530644: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-12 06:14:56.531765: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-01-12 06:14:56.556211: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:56.557239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2021-01-12 06:14:56.557279: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-01-12 06:14:56.560870: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-01-12 06:14:56.560925: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-01-12 06:14:56.562425: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-01-12 06:14:56.562733: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-01-12 06:14:56.566658: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-01-12 06:14:56.567536: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-01-12 06:14:56.567745: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-01-12 06:14:56.567876: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:56.568916: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:56.569861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-01-12 06:14:56.569910: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-01-12 06:14:57.499897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-01-12 06:14:57.499958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-01-12 06:14:57.499971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-01-12 06:14:57.500311: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.501411: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.502432: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.503399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14760 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0) 2021-01-12 06:14:57.532771: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-12 06:14:57.532935: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.533887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2021-01-12 06:14:57.533946: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-01-12 06:14:57.534012: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-01-12 06:14:57.534045: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-01-12 06:14:57.534073: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-01-12 06:14:57.534093: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-01-12 06:14:57.534112: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-01-12 06:14:57.534150: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-01-12 06:14:57.534188: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-01-12 06:14:57.534291: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.535313: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.536244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-01-12 06:14:57.536523: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-12 06:14:57.536640: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.537636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2021-01-12 06:14:57.537673: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-01-12 06:14:57.537696: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-01-12 06:14:57.537729: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-01-12 06:14:57.537749: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-01-12 06:14:57.537769: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-01-12 06:14:57.537794: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-01-12 06:14:57.537818: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-01-12 06:14:57.537838: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-01-12 06:14:57.537916: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.538900: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.539790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-01-12 06:14:57.539828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-01-12 06:14:57.539838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-01-12 06:14:57.539854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-01-12 06:14:57.539958: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.540987: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:14:57.541937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14760 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0) 2021-01-12 06:14:57.924484: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2021-01-12 06:14:57.925171: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300070000 Hz 2021-01-12 06:14:58.481369: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-01-12 06:14:58.937601: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-01-12 06:14:58.940892: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 #Parameter list -------------------------------- (Known) Namespace(activation=‘relu’, batch_size=16, epochs=5, file_name=‘model_for_katib2’, learning_rate=0, log_dir=‘/kn/data’, model_dir=‘/kn/data/models’, model_ver=0.3, train=‘/kn/data/dataset/region1’, work_dir=‘/kn/data’) -------------------------------- (Unknown) [‘–model_name=kn-cnn’]
Found 70 files belonging to 4 classes. Using 4 files for training. Found 70 files belonging to 4 classes. Using 66 files for validation. Epoch 1/5 1/1 [==============================] - 4s 4s/step - loss: 1.3791 - accuracy: 0.2500 - val_loss: 0.8957 - val_accuracy: 0.8939 Epoch 2/5 1/1 [==============================] - 0s 40ms/step - loss: 0.8655 - accuracy: 1.0000 - val_loss: 0.3944 - val_accuracy: 0.8939 Epoch 3/5 1/1 [==============================] - 0s 40ms/step - loss: 0.3163 - accuracy: 1.0000 - val_loss: 0.2012 - val_accuracy: 0.9091 Epoch 4/5 1/1 [==============================] - 0s 66ms/step - loss: 0.0639 - accuracy: 1.0000 - val_loss: 0.1549 - val_accuracy: 0.9545 Epoch 5/5 1/1 [==============================] - 0s 40ms/step - loss: 0.0069 - accuracy: 1.0000 - val_loss: 0.1389 - val_accuracy: 0.9545 2021-01-12 06:15:01.874995: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-01-12 06:15:01.875337: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:15:01.876341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2021-01-12 06:15:01.876408: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2021-01-12 06:15:01.876458: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2021-01-12 06:15:01.876493: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11 2021-01-12 06:15:01.876525: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2021-01-12 06:15:01.876557: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2021-01-12 06:15:01.876590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2021-01-12 06:15:01.876624: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2021-01-12 06:15:01.876658: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2021-01-12 06:15:01.876757: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:15:01.877778: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:15:01.878652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0 2021-01-12 06:15:01.878699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-01-12 06:15:01.878712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0 2021-01-12 06:15:01.878720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N 2021-01-12 06:15:01.878848: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:15:01.879784: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-01-12 06:15:01.880651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 14760 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0) 2021-01-12 06:15:02.834707: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them. [name: “/device:CPU:0” device_type: “CPU” memory_limit: 268435456 locality { } incarnation: 9325939197827833295 , name: “/device:GPU:0” device_type: “GPU” memory_limit: 15477595200 locality { bus_id: 1 links { } } incarnation: 4128588976737718326 physical_device_desc: “device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0” ] <tensorflow.python.keras.callbacks.History object at 0x7fe078318668> Found 70 files belonging to 4 classes. [[18 0 0 0] [ 0 20 0 0] [ 0 0 16 0] [ 0 3 0 13]] #Save model : /kn/data/models/0.3 #Saved cm : /kn/data/models/cm
Environment:
- Kubeflow version (
kfctl version
): 1.2 - AWS ec2 p3xlarge
- Kubernetes version: (use
kubectl version
): 1.15 - OS (e.g. from
/etc/os-release
): ubuntu 18.04
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (11 by maintainers)
Top GitHub Comments
@rky0930 It’s great! Feel free to open new issue if you have any other problems.
@andreyvelich Thanks for your help! Now it works!