GPU not consuming for Katib experiment - GKE Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
See original GitHub issue/kind bug
What steps did you take and what happened: I am trying to create a kubeflow pipeline that tunes the hyper parameters of a text classification model in tensorflow using katib on GKE clusters. I created a cluster using the below commands
CLUSTER_NAME="kubeflow-pipelines-standalone-v2"
ZONE="us-central1-a"
MACHINE_TYPE="n1-standard-2"
SCOPES="cloud-platform"
NODES_NUM=1
gcloud container clusters create $CLUSTER_NAME --zone $ZONE --machine-type $MACHINE_TYPE --scopes $SCOPES --num-nodes $NODES_NUM
gcloud config set compute/zone $ZONE
gcloud container clusters get-credentials $CLUSTER_NAME
export PIPELINE_VERSION=1.8.2
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=$PIPELINE_VERSION"
# katib
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.13.0"
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.4.0"
kubectl apply -f ./test.yaml
# disabling caching
export NAMESPACE=kubeflow
kubectl get mutatingwebhookconfiguration cache-webhook-${NAMESPACE}
kubectl patch mutatingwebhookconfiguration cache-webhook-${NAMESPACE} --type='json' -p='[{"op":"replace", "path": "/webhooks/0/rules/0/operations/0", "value": "DELETE"}]'
kubectl describe configmap inverse-proxy-config -n kubeflow | grep googleusercontent.com
GPU_POOL_NAME="gpu-pool2"
CLUSTER_NAME="kubeflow-pipelines-standalone-v2"
CLUSTER_ZONE="us-central1-a"
GPU_TYPE="nvidia-tesla-k80"
GPU_COUNT=1
MACHINE_TYPE="n1-highmem-8"
NODES_NUM=1
# Node pool creation may take several minutes.
gcloud container node-pools create ${GPU_POOL_NAME} --accelerator type=${GPU_TYPE},count=${GPU_COUNT} --zone ${CLUSTER_ZONE} --cluster ${CLUSTER_NAME} --num-nodes=0 --machine-type=${MACHINE_TYPE} --scopes=cloud-platform --num-nodes $NODES_NUM
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
I then created a kubeflow pipeline:
from kfp import compiler
import kfp
import kfp.dsl as dsl
from kfp import components
@dsl.pipeline(
name="End to End Pipeline",
description="An end to end mnist example including hyperparameter tuning, train and inference"
)
def pipeline_func(
time_loc = "gs://faris_bucket_us_central/Pipeline_data/input_dataset/dbpedia_model/GKE_Katib/time_csv/",
hyper_image_uri_train = "gcr.io/.............../hptunekatib:v7",
hyper_image_uri = "gcr.io/.............../hptunekatibclient:v7",
model_uri = "gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/",
experiment_name = "dbpedia-exp-1",
experiment_namespace = "kubeflow",
experiment_timeout_minutes = 60
):
# first stage : ingest and preprocess -> returns uploaded gcs URI for the pre processed dataset, setting memmory to 32GB, CPU to 4 CPU
hp_tune = dsl.ContainerOp(
name='hp-tune-katib',
image=hyper_image_uri,
arguments=[
'--experiment_name', experiment_name,
'--experiment_namespace', experiment_namespace,
'--experiment_timeout_minutes', experiment_timeout_minutes,
'--delete_after_done', True,
'--hyper_image_uri', hyper_image_uri_train,
'--time_loc', time_loc,
'--model_uri', model_uri
],
file_outputs={'best-params': '/output.txt'}
).set_gpu_limit(1)
# restricting the maximum usable memory and cpu for preprocess stage
hp_tune.set_memory_limit("49G")
hp_tune.set_cpu_limit("7")
# Run the Kubeflow Pipeline in the user's namespace.
if __name__ == '__main__':
# compiling the model and generating tar.gz file to upload to Kubeflow Pipeline UI
import kfp.compiler as compiler
compiler.Compiler().compile(
pipeline_func, 'pipeline_db.tar.gz'
)
These are my two continers.
- To launch the katib experiments based on the specified parameters and arguments passed to the dsl.ContainerOp()
- The main training script for text classification. This container is passed as “image” to the trial spec for katib
gcr.io/…/hptunekatibclient:v7
# importing required packages
import argparse
import datetime
from datetime import datetime as dt
from distutils.util import strtobool
import json
import os
import logging
import time
import pandas as pd
from google.cloud import storage
from pytz import timezone
from kubernetes.client import V1ObjectMeta
from kubeflow.katib import KatibClient
from kubeflow.katib import ApiClient
from kubeflow.katib import V1beta1Experiment
from kubeflow.katib import ApiClient
from kubeflow.katib import V1beta1ExperimentSpec
from kubeflow.katib import V1beta1AlgorithmSpec
from kubeflow.katib import V1beta1ObjectiveSpec
from kubeflow.katib import V1beta1ParameterSpec
from kubeflow.katib import V1beta1FeasibleSpace
from kubeflow.katib import V1beta1TrialTemplate
from kubeflow.katib import V1beta1TrialParameterSpec
from kubeflow.katib import V1beta1MetricsCollectorSpec
from kubeflow.katib import V1beta1CollectorSpec
from kubeflow.katib import V1beta1FileSystemPath
from kubeflow.katib import V1beta1SourceSpec
from kubeflow.katib import V1beta1FilterSpec
logger = logging.getLogger()
logging.basicConfig(level=logging.INFO)
FINISH_CONDITIONS = ["Succeeded", "Failed"]
# function to record the start time and end time to calculate execution time, pipeline start up and teardown time
def write_time(types, time_loc):
formats = "%Y-%m-%d %I:%M:%S %p"
now_utc = dt.now(timezone('UTC'))
now_asia = now_utc.astimezone(timezone('Asia/Kolkata'))
start_time = str(now_asia.strftime(formats))
time_df = pd.DataFrame({"time":[start_time]})
print("written")
time_df.to_csv(time_loc + types + ".csv", index=False)
def get_args():
parser = argparse.ArgumentParser(description='Katib Experiment launcher')
parser.add_argument('--experiment_name', type=str,
help='Experiment name')
parser.add_argument('--experiment_namespace', type=str, default='anonymous',
help='Experiment namespace')
parser.add_argument('--experiment_timeout_minutes', type=int, default=60*24,
help='Time in minutes to wait for the Experiment to complete')
parser.add_argument('--delete_after_done', type=strtobool, default=True,
help='Whether to delete the Experiment after it is finished')
parser.add_argument('--hyper_image_uri', type=str, default="gcr.io/.............../hptunekatib:v2",
help='Hyper image uri')
parser.add_argument('--time_loc', type=str, default="gs://faris_bucket_us_central/Pipeline_data/input_dataset/dbpedia_model/GKE_Katib/time_csv/",
help='Time loc')
parser.add_argument('--model_uri', type=str, default="gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/",
help='Model URI')
return parser.parse_args()
def wait_experiment_finish(katib_client, experiment, timeout):
polling_interval = datetime.timedelta(seconds=30)
end_time = datetime.datetime.now() + datetime.timedelta(minutes=timeout)
experiment_name = experiment.metadata.name
experiment_namespace = experiment.metadata.namespace
while True:
current_status = None
try:
current_status = katib_client.get_experiment_status(name=experiment_name, namespace=experiment_namespace)
except Exception as e:
logger.info("Unable to get current status for the Experiment: {} in namespace: {}. Exception: {}".format(
experiment_name, experiment_namespace, e))
# If Experiment has reached complete condition, exit the loop.
if current_status in FINISH_CONDITIONS:
logger.info("Experiment: {} in namespace: {} has reached the end condition: {}".format(
experiment_name, experiment_namespace, current_status))
return
# Print the current condition.
logger.info("Current condition for Experiment: {} in namespace: {} is: {}".format(
experiment_name, experiment_namespace, current_status))
# If timeout has been reached, rise an exception.
if datetime.datetime.now() > end_time:
raise Exception("Timout waiting for Experiment: {} in namespace: {} "
"to reach one of these conditions: {}".format(
experiment_name, experiment_namespace, FINISH_CONDITIONS))
# Sleep for poll interval.
time.sleep(polling_interval.seconds)
if __name__ == "__main__":
args = get_args()
write_time("hyper_parameter_tuning_start", args.time_loc)
# Trial count specification.
max_trial_count = 2
max_failed_trial_count = 2
parallel_trial_count = 1
# Objective specification.
objective = V1beta1ObjectiveSpec(
type="minimize",
# goal=100,
objective_metric_name="accuracy"
# additional_metric_names=["accuracy"]
)
# Objective specification.
# metrics_collector_specs = V1beta1MetricsCollectorSpec(
# collector=V1beta1CollectorSpec(kind="File"),
# source=V1beta1SourceSpec(
# file_system_path=V1beta1FileSystemPath(
# # format="TEXT",
# path="/opt/trainer/katib/metrics.log",
# kind="File"
# ),
# filter=V1beta1FilterSpec(
# # metrics_format=["{metricName: ([\\w|-]+), metricValue: ((-?\\d+)(\\.\\d+)?)}"]
# metrics_format=["([\w|-]+)\s*=\s*([+-]?\d*(\.\d+)?([Ee][+-]?\d+)?)"]
# )
# )
# )
# Algorithm specification.
algorithm = V1beta1AlgorithmSpec(
algorithm_name="random",
)
# Experiment search space.
# In this example we tune learning rate and batch size.
parameters = [
V1beta1ParameterSpec(
name="batch_size",
parameter_type="discrete",
feasible_space=V1beta1FeasibleSpace(
list=["32", "42", "52", "62", "64"]
),
),
V1beta1ParameterSpec(
name="learning_rate",
parameter_type="double",
feasible_space=V1beta1FeasibleSpace(
min="0.001",
max="0.005"
),
)
]
# TODO (andreyvelich): Use community image for the mnist example.
trial_spec = {
"apiVersion": "kubeflow.org/v1",
"kind": "TFJob",
"spec": {
"tfReplicaSpecs": {
"PS": {
"replicas": 1,
"restartPolicy": "Never",
"template": {
"metadata": {
"annotations": {
"sidecar.istio.io/inject": "false",
}
},
"spec": {
"containers": [
{
"name": "tensorflow",
"image": args.hyper_image_uri,
"command": [
"python",
"/opt/trainer/task.py",
"--model_uri=" + args.model_uri,
"--batch_size=${trialParameters.batchSize}",
"--learning_rate=${trialParameters.learningRate}"
],
"ports" : [
{
"containerPort": 2222,
"name" : "tfjob-port"
}
]
# "resources": {
# "limits" : {
# "cpu": "1"
# }
# }
}
]
}
}
},
"Worker": {
"replicas": 1,
"restartPolicy": "Never",
"template": {
"metadata": {
"annotations": {
"sidecar.istio.io/inject": "false",
}
},
"spec": {
"containers": [
{
"name": "tensorflow",
"image": args.hyper_image_uri,
"command": [
"python",
"/opt/trainer/task.py",
"--model_uri=" + args.model_uri,
"--batch_size=${trialParameters.batchSize}",
"--learning_rate=${trialParameters.learningRate}"
],
"ports" : [
{
"containerPort": 2222,
"name" : "tfjob-port"
}
]
# "resources": {
# "limits" : {
# "nvidia.com/gpu": 1
# }
# }
}
]
}
}
}
}
}
}
# Configure parameters for the Trial template.
trial_template = V1beta1TrialTemplate(
primary_container_name="tensorflow",
trial_parameters=[
V1beta1TrialParameterSpec(
name="batchSize",
description="batch size",
reference="batch_size"
),
V1beta1TrialParameterSpec(
name="learningRate",
description="Learning rate",
reference="learning_rate"
),
],
trial_spec=trial_spec
)
# Create an Experiment from the above parameters.
experiment_spec = V1beta1ExperimentSpec(
max_trial_count=max_trial_count,
max_failed_trial_count=max_failed_trial_count,
parallel_trial_count=parallel_trial_count,
objective=objective,
algorithm=algorithm,
parameters=parameters,
trial_template=trial_template
)
experiment_name = args.experiment_name
experiment_namespace = args.experiment_namespace
logger.info("Creating Experiment: {} in namespace: {}".format(experiment_name, experiment_namespace))
# Create Experiment object.
experiment = V1beta1Experiment(
api_version="kubeflow.org/v1beta1",
kind="Experiment",
metadata=V1ObjectMeta(
name=experiment_name,
namespace=experiment_namespace
),
spec=experiment_spec
)
logger.info("Experiment Spec : " + str(experiment_spec))
logger.info("Experiment: " + str(experiment))
# Create Katib client.
katib_client = KatibClient()
# Create Experiment in Kubernetes cluster.
output = katib_client.create_experiment(experiment, namespace=experiment_namespace)
# Wait until Experiment is created.
end_time = datetime.datetime.now() + datetime.timedelta(minutes=60)
while True:
current_status = None
# Try to get Experiment status.
try:
current_status = katib_client.get_experiment_status(name=experiment_name, namespace=experiment_namespace)
except Exception:
logger.info("Waiting until Experiment is created...")
# If current status is set, exit the loop.
if current_status is not None:
break
# If timeout has been reached, rise an exception.
if datetime.datetime.now() > end_time:
raise Exception("Timout waiting for Experiment: {} in namespace: {} to be created".format(
experiment_name, experiment_namespace))
time.sleep(1)
logger.info("Experiment is created")
# Wait for Experiment finish.
wait_experiment_finish(katib_client, experiment, args.experiment_timeout_minutes)
# Check if Experiment is successful.
if katib_client.is_experiment_succeeded(name=experiment_name, namespace=experiment_namespace):
logger.info("Experiment: {} in namespace: {} is successful".format(
experiment_name, experiment_namespace))
optimal_hp = katib_client.get_optimal_hyperparameters(
name=experiment_name, namespace=experiment_namespace)
logger.info("Optimal hyperparameters:\n{}".format(optimal_hp))
# # Create dir if it doesn't exist.
# if not os.path.exists(os.path.dirname("output.txt")):
# os.makedirs(os.path.dirname("output.txt"))
# Save HyperParameters to the file.
with open("output.txt", 'w') as f:
f.write(json.dumps(optimal_hp))
else:
logger.info("Experiment: {} in namespace: {} is failed".format(
experiment_name, experiment_namespace))
# Print Experiment if it is failed.
experiment = katib_client.get_experiment(name=experiment_name, namespace=experiment_namespace)
logger.info(experiment)
# Delete Experiment if it is needed.
if args.delete_after_done:
katib_client.delete_experiment(name=experiment_name, namespace=experiment_namespace)
logger.info("Experiment: {} in namespace: {} has been deleted".format(
experiment_name, experiment_namespace))
write_time("hyper_parameter_tuning_end", args.time_loc)
Dockerfile
FROM gcr.io/deeplearning-platform-release/tf-gpu.2-8
# installing packages
RUN pip install pandas
RUN pip install gcsfs
RUN pip install google-cloud-storage
RUN pip install pytz
RUN pip install kubernetes
RUN pip install kubeflow-katib
# moving code to preprocess
RUN mkdir /hp_tune
COPY task.py /hp_tune
# CREDENTIAL Authentication
COPY /prj-vertex-ai-2c390f7e8fec.json /hp_tune/prj-vertex-ai-2c390f7e8fec.json
ENV GOOGLE_APPLICATION_CREDENTIALS="/hp_tune/prj-vertex-ai-2c390f7e8fec.json"
# entry point
ENTRYPOINT ["python3", "/hp_tune/task.py"]
gcr.io/…/hptunekatib:v7
# import os
# os.system("pip install tensorflow-gpu==2.8.0")
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import os
from tensorflow.keras.layers import Conv1D, MaxPool1D ,Embedding ,concatenate
from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense,Input
from tensorflow.keras.models import Model
from tensorflow import keras
from datetime import datetime
from pytz import timezone
from sklearn.model_selection import train_test_split
import pandas as pd
from google.cloud import storage
import argparse
import logging
logger = logging.getLogger()
logging.basicConfig(level=logging.INFO)
logger.info("Num GPUs Available: " + str(tf.config.list_physical_devices('GPU')))
import subprocess
process = subprocess.Popen(['sh', '-c', 'nvidia-smi'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
logger.info("NVIDIA SMI " + str(out))
def format_strs(x):
strs = ""
if x > 0:
sign_t = "+"
strs += "+"
else:
sign_t = "-"
strs += "-"
strs = strs + "{:.1e}".format(x)
if "+" in strs[1:]:
sign = "+"
strs = strs[1:].split("+")
else:
sign = "-"
strs = strs[1:].split("-")
last_d = strs[1][1:] if strs[1][0] == "0" else strs[1]
strs_f = sign_t + strs[0] + sign + last_d
return strs_f
def get_args():
'''Parses args. Must include all hyperparameters you want to tune.'''
parser = argparse.ArgumentParser()
parser.add_argument(
'--learning_rate',
required=True,
type=float,
help='learning_rate')
parser.add_argument(
'--batch_size',
required=True,
type=int,
help='batch_size')
parser.add_argument(
'--model_uri',
required=True,
type=str,
help='Model Uri')
args = parser.parse_args()
return args
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# The ID of your GCS bucket
# bucket_name = "your-bucket-name"
# The ID of your GCS object
# source_blob_name = "storage-object-name"
# The path to which the file should be downloaded
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
# Construct a client side representation of a blob.
# Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
# any content from Google Cloud Storage. As we don't need additional data,
# using `Bucket.blob` is preferred here.
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
def create_dataset():
download_blob("faris_bucket_us_central", "Pipeline_data/input_dataset/dbpedia_model/data/" + "train.csv", "train.csv")
trainData = pd.read_csv('train.csv')
trainData.columns = ['label','title','description']
# trainData = trainData.sample(frac=0.002)
X_train, X_test, y_train, y_test = train_test_split(trainData['description'], trainData['label'], stratify=trainData['label'], test_size=0.1, random_state=0)
return X_train, X_test, y_train, y_test
def train_model(train_X, train_y, test_X, test_y, learning_rate, batch_size):
logger.info("Training with lr = " + str(learning_rate) + "bs = " + str(batch_size))
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/2", trainable=False)
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)
# Neural network layers
l = tf.keras.layers.Dropout(0.2, name="dropout")(outputs['pooled_output']) # dropout_rate
l = tf.keras.layers.Dense(14,activation='softmax',kernel_initializer=tf.keras.initializers.GlorotNormal(seed=24))(l) # dense_units
model = Model(inputs=[text_input], outputs=l)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),loss='categorical_crossentropy',metrics=['accuracy'])
history = model.fit(train_X, train_y, epochs=5, validation_data=(test_X, test_y), batch_size=batch_size)
return model, history
def main():
args = get_args()
logger.info("Creating dataset")
train_X, test_X, train_y, test_y = create_dataset()
# one_hot_encoding the class label
encoder = LabelEncoder()
encoder.fit(train_y)
y_train_encoded = encoder.transform(train_y)
y_test_encoded = encoder.transform(test_y)
y_train_ohe = tf.keras.utils.to_categorical(y_train_encoded)
y_test_ohe = tf.keras.utils.to_categorical(y_test_encoded)
logger.info("Training model")
model = train_model(
train_X,
y_train_ohe,
test_X,
y_test_ohe,
args.learning_rate,
int(float(args.batch_size))
)
logger.info("Saving model")
artifact_filename = 'saved_model'
local_path = artifact_filename
tf.saved_model.save(model[0], local_path)
# Upload model artifact to Cloud Storage
model_directory = args.model_uri + "-".join(os.environ["HOSTNAME"].split("-")[:-2]) + "/"
local_path = "saved_model/assets/vocab.txt"
storage_path = os.path.join(model_directory, "assets/vocab.txt")
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)
local_path = "saved_model/variables/variables.data-00000-of-00001"
storage_path = os.path.join(model_directory, "variables/variables.data-00000-of-00001")
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)
local_path = "saved_model/variables/variables.index"
storage_path = os.path.join(model_directory, "variables/variables.index")
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)
local_path = "saved_model/saved_model.pb"
storage_path = os.path.join(model_directory, "saved_model.pb")
blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
blob.upload_from_filename(local_path)
logger.info("Model Saved at " + model_directory)
logger.info("Keras Score: " + str(model[1].history["accuracy"][-1]))
hp_metric = model[1].history["accuracy"][-1]
print("accuracy =", format_strs(hp_metric))
if __name__ == "__main__":
main()
Dockerfile
# FROM gcr.io/deeplearning-platform-release/tf-cpu.2-8
FROM gcr.io/deeplearning-platform-release/tf-gpu.2-8
RUN mkdir -p /opt/trainer
# RUN pip install scikit-learn
RUN pip install tensorflow_text==2.8.1
# RUN pip install tensorflow-gpu==2.8.0
# CREDENTIAL Authentication
COPY /prj-vertex-ai-2c390f7e8fec.json /prj-vertex-ai-2c390f7e8fec.json
ENV GOOGLE_APPLICATION_CREDENTIALS="/prj-vertex-ai-2c390f7e8fec.json"
COPY *.py /opt/trainer/
# # RUN chgrp -R 0 /opt/trainer && chmod -R g+rwX /opt/trainer
# RUN chmod -R 777 /home/trainer
ENTRYPOINT ["python", "/opt/trainer/task.py"]
# Sets up the entry point to invoke the trainer.
# ENTRYPOINT ["python", "-m", "trainer.task"]
The pipeline runs but it doesnot use the GPU and this piece of code
logger.info("Num GPUs Available: " + str(tf.config.list_physical_devices('GPU')))
import subprocess
process = subprocess.Popen(['sh', '-c', 'nvidia-smi'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
logger.info("NVIDIA SMI " + str(out))
gives empty list and empty string. It is like the GPU doesnot exist. I am attaching the logs of the container
insertId | labels."compute.googleapis.com/resource_name" | labels."k8s-pod/group-name" | labels."k8s-pod/job-name" | labels."k8s-pod/replica-index" | labels."k8s-pod/replica-type" | labels."k8s-pod/training_kubeflow_org/job-name" | labels."k8s-pod/training_kubeflow_org/operator-name" | labels."k8s-pod/training_kubeflow_org/replica-index" | labels."k8s-pod/training_kubeflow_org/replica-type" | logName | receiveLocation | receiveTimestamp | receivedLocation | resource.labels.cluster_name | resource.labels.container_name | resource.labels.location | resource.labels.namespace_name | resource.labels.pod_name | resource.labels.project_id | resource.type | severity | textPayload | timestamp
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
saaah727bfds9ymw | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stdout | 2022-07-11T06:07:35.222632672Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | INFO | accuracy = +9.9e-1 | 2022-07-11T06:07:30.812554270Z
cg5hf72zfi4a8ymi | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | INFO:root:Num GPUs Available: [] | 2022-07-11T06:07:30.812527036Z
0n32rintpe0v865p | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811609: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (dbpedia-exp-1-ntq7tfvj-ps-0): /proc/driver/nvidia/version does not exist | 2022-07-11T06:07:30.812519914Z
et3b3w8ji0nlmfc3 | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811541: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303) | 2022-07-11T06:07:30.812511863Z
u8jhqsnsjg3n114l | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811461: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load /kind bug
What did you expect to happen:
I expected the pipeline stage to use GPU and run the text classiication using GPU but it doesnt.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
- Katib version (check the Katib controller image version): v0.13.0
- Kubernetes version: (
kubectl version
): 1.22.8-gke.202 - OS (
uname -a
): linux/ COS in containers
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍
Issue Analytics
- State:
- Created a year ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
This is not specific to Katib. It means that trials could not find a node which satisfies these resource requirements to start the pod One thing to note: When you add resource requirements to trial spec, every trial pod will try to request the same set of resources when run in parallel. Eg: If trialSpec has 1 GPU requirement and if experimentSpec allows 3 parallelTrials, then each trial pod will request 1 GPU each(total of 3 GPUs)
Here is the gist of my working sample, you can ignore the node selector stuff, it just helps to schedule the pod on the gpu node I want (dedicated for training in my case) :