Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPU not consuming for Katib experiment - GKE Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory

See original GitHub issue

/kind bug

What steps did you take and what happened: I am trying to create a kubeflow pipeline that tunes the hyper parameters of a text classification model in tensorflow using katib on GKE clusters. I created a cluster using the below commands

CLUSTER_NAME="kubeflow-pipelines-standalone-v2"
ZONE="us-central1-a"
MACHINE_TYPE="n1-standard-2"
SCOPES="cloud-platform"
NODES_NUM=1

gcloud container clusters create $CLUSTER_NAME --zone $ZONE --machine-type $MACHINE_TYPE --scopes $SCOPES --num-nodes $NODES_NUM

gcloud config set compute/zone $ZONE
gcloud container clusters get-credentials $CLUSTER_NAME

export PIPELINE_VERSION=1.8.2
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=$PIPELINE_VERSION"
# katib
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.13.0"
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.4.0"
kubectl apply -f ./test.yaml

# disabling caching
export NAMESPACE=kubeflow
kubectl get mutatingwebhookconfiguration cache-webhook-${NAMESPACE}
kubectl patch mutatingwebhookconfiguration cache-webhook-${NAMESPACE} --type='json' -p='[{"op":"replace", "path": "/webhooks/0/rules/0/operations/0", "value": "DELETE"}]'

kubectl describe configmap inverse-proxy-config -n kubeflow | grep googleusercontent.com

GPU_POOL_NAME="gpu-pool2"
CLUSTER_NAME="kubeflow-pipelines-standalone-v2"
CLUSTER_ZONE="us-central1-a"
GPU_TYPE="nvidia-tesla-k80"
GPU_COUNT=1
MACHINE_TYPE="n1-highmem-8"
NODES_NUM=1

# Node pool creation may take several minutes.
gcloud container node-pools create ${GPU_POOL_NAME} --accelerator type=${GPU_TYPE},count=${GPU_COUNT} --zone ${CLUSTER_ZONE} --cluster ${CLUSTER_NAME} --num-nodes=0 --machine-type=${MACHINE_TYPE} --scopes=cloud-platform --num-nodes $NODES_NUM
  
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

I then created a kubeflow pipeline:


from kfp import compiler
import kfp
import kfp.dsl as dsl
from kfp import components

@dsl.pipeline(
    name="End to End Pipeline",
    description="An end to end mnist example including hyperparameter tuning, train and inference"
)
def pipeline_func(
    time_loc = "gs://faris_bucket_us_central/Pipeline_data/input_dataset/dbpedia_model/GKE_Katib/time_csv/",
    hyper_image_uri_train = "gcr.io/.............../hptunekatib:v7",
    hyper_image_uri = "gcr.io/.............../hptunekatibclient:v7",
    model_uri = "gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/",
    experiment_name = "dbpedia-exp-1",
    experiment_namespace = "kubeflow",
    experiment_timeout_minutes = 60
):
    
    # first stage : ingest and preprocess -> returns uploaded gcs URI for the pre processed dataset, setting memmory to 32GB, CPU to 4 CPU
    hp_tune = dsl.ContainerOp(
          name='hp-tune-katib',
          image=hyper_image_uri,
          arguments=[
            '--experiment_name', experiment_name,
            '--experiment_namespace', experiment_namespace,
            '--experiment_timeout_minutes', experiment_timeout_minutes,
            '--delete_after_done', True,
            '--hyper_image_uri', hyper_image_uri_train,
            '--time_loc', time_loc, 
            '--model_uri', model_uri

          ],
          file_outputs={'best-params': '/output.txt'}
        ).set_gpu_limit(1)
    
    # restricting the maximum usable memory and cpu for preprocess stage
    hp_tune.set_memory_limit("49G")
    hp_tune.set_cpu_limit("7")

# Run the Kubeflow Pipeline in the user's namespace.
if __name__ == '__main__':
    
    # compiling the model and generating tar.gz file to upload to Kubeflow Pipeline UI
    import kfp.compiler as compiler

    compiler.Compiler().compile(
        pipeline_func, 'pipeline_db.tar.gz'
    )

These are my two continers.

To launch the katib experiments based on the specified parameters and arguments passed to the dsl.ContainerOp()
The main training script for text classification. This container is passed as “image” to the trial spec for katib

gcr.io/…/hptunekatibclient:v7

# importing required packages
import argparse
import datetime
from datetime import datetime as dt
from distutils.util import strtobool
import json
import os
import logging
import time
import pandas as pd
from google.cloud import storage
from pytz import timezone

from kubernetes.client import V1ObjectMeta

from kubeflow.katib import KatibClient
from kubeflow.katib import ApiClient
from kubeflow.katib import V1beta1Experiment

from kubeflow.katib import ApiClient
from kubeflow.katib import V1beta1ExperimentSpec
from kubeflow.katib import V1beta1AlgorithmSpec
from kubeflow.katib import V1beta1ObjectiveSpec
from kubeflow.katib import V1beta1ParameterSpec
from kubeflow.katib import V1beta1FeasibleSpace
from kubeflow.katib import V1beta1TrialTemplate
from kubeflow.katib import V1beta1TrialParameterSpec
from kubeflow.katib import V1beta1MetricsCollectorSpec
from kubeflow.katib import V1beta1CollectorSpec
from kubeflow.katib import V1beta1FileSystemPath
from kubeflow.katib import V1beta1SourceSpec
from kubeflow.katib import V1beta1FilterSpec

logger = logging.getLogger()
logging.basicConfig(level=logging.INFO)

FINISH_CONDITIONS = ["Succeeded", "Failed"]


# function to record the start time and end time to calculate execution time, pipeline start up and teardown time
def write_time(types, time_loc):

    formats = "%Y-%m-%d %I:%M:%S %p"

    now_utc = dt.now(timezone('UTC'))
    now_asia = now_utc.astimezone(timezone('Asia/Kolkata'))
    start_time = str(now_asia.strftime(formats))
    time_df = pd.DataFrame({"time":[start_time]})
    print("written")
    time_df.to_csv(time_loc + types + ".csv", index=False)


def get_args():
    parser = argparse.ArgumentParser(description='Katib Experiment launcher')
    parser.add_argument('--experiment_name', type=str,
                        help='Experiment name')
    parser.add_argument('--experiment_namespace', type=str, default='anonymous',
                        help='Experiment namespace')
    parser.add_argument('--experiment_timeout_minutes', type=int, default=60*24,
                        help='Time in minutes to wait for the Experiment to complete')
    parser.add_argument('--delete_after_done', type=strtobool, default=True,
                        help='Whether to delete the Experiment after it is finished')
    parser.add_argument('--hyper_image_uri', type=str, default="gcr.io/.............../hptunekatib:v2",
                        help='Hyper image uri')
    parser.add_argument('--time_loc', type=str, default="gs://faris_bucket_us_central/Pipeline_data/input_dataset/dbpedia_model/GKE_Katib/time_csv/",
                        help='Time loc')
    parser.add_argument('--model_uri', type=str, default="gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/",
                        help='Model URI')
    
    return parser.parse_args()

def wait_experiment_finish(katib_client, experiment, timeout):
    polling_interval = datetime.timedelta(seconds=30)
    end_time = datetime.datetime.now() + datetime.timedelta(minutes=timeout)
    experiment_name = experiment.metadata.name
    experiment_namespace = experiment.metadata.namespace
    while True:
        current_status = None
        try:
            current_status = katib_client.get_experiment_status(name=experiment_name, namespace=experiment_namespace)
        except Exception as e:
            logger.info("Unable to get current status for the Experiment: {} in namespace: {}. Exception: {}".format(
                experiment_name, experiment_namespace, e))
        # If Experiment has reached complete condition, exit the loop.
        if current_status in FINISH_CONDITIONS:
            logger.info("Experiment: {} in namespace: {} has reached the end condition: {}".format(
                experiment_name, experiment_namespace, current_status))
            return
        # Print the current condition.
        logger.info("Current condition for Experiment: {} in namespace: {} is: {}".format(
            experiment_name, experiment_namespace, current_status))
        # If timeout has been reached, rise an exception.
        if datetime.datetime.now() > end_time:
            raise Exception("Timout waiting for Experiment: {} in namespace: {} "
                            "to reach one of these conditions: {}".format(
                                experiment_name, experiment_namespace, FINISH_CONDITIONS))
        # Sleep for poll interval.
        time.sleep(polling_interval.seconds)


if __name__ == "__main__":
    

    args = get_args()
    
    write_time("hyper_parameter_tuning_start", args.time_loc)
    
    # Trial count specification.
    max_trial_count = 2
    max_failed_trial_count = 2
    parallel_trial_count = 1

    # Objective specification.
    objective = V1beta1ObjectiveSpec(
        type="minimize",
        # goal=100,
        objective_metric_name="accuracy"
        # additional_metric_names=["accuracy"]
    )

    # Objective specification.
#     metrics_collector_specs = V1beta1MetricsCollectorSpec(
#         collector=V1beta1CollectorSpec(kind="File"),
#         source=V1beta1SourceSpec(
#             file_system_path=V1beta1FileSystemPath(
#                 # format="TEXT",
#                 path="/opt/trainer/katib/metrics.log",
#                 kind="File"
#             ),
#             filter=V1beta1FilterSpec(
#                 # metrics_format=["{metricName: ([\\w|-]+), metricValue: ((-?\\d+)(\\.\\d+)?)}"]
#                 metrics_format=["([\w|-]+)\s*=\s*([+-]?\d*(\.\d+)?([Ee][+-]?\d+)?)"]

#             )
#         )
#     )

    # Algorithm specification.
    algorithm = V1beta1AlgorithmSpec(
        algorithm_name="random",
    )

    # Experiment search space.
    # In this example we tune learning rate and batch size.
    parameters = [
        V1beta1ParameterSpec(
            name="batch_size",
            parameter_type="discrete",
            feasible_space=V1beta1FeasibleSpace(
                list=["32", "42", "52", "62", "64"]
            ),
        ),
        V1beta1ParameterSpec(
            name="learning_rate",
            parameter_type="double",
            feasible_space=V1beta1FeasibleSpace(
                min="0.001",
                max="0.005"
            ),
        )
    ]

    # TODO (andreyvelich): Use community image for the mnist example.
    trial_spec = {
        "apiVersion": "kubeflow.org/v1",
        "kind": "TFJob",
        "spec": {
            "tfReplicaSpecs": {
                "PS": {
                    "replicas": 1,
                    "restartPolicy": "Never",
                    "template": {
                        "metadata": {
                            "annotations": {
                                "sidecar.istio.io/inject": "false",
                            }
                        },
                        "spec": {
                            "containers": [
                                {
                                    "name": "tensorflow",
                                    "image": args.hyper_image_uri,
                                    "command": [
                                        "python",
                                        "/opt/trainer/task.py",
                                        "--model_uri=" + args.model_uri,
                                        "--batch_size=${trialParameters.batchSize}",
                                        "--learning_rate=${trialParameters.learningRate}"

                                    ],
                                    "ports" : [
                                        {
                                            "containerPort": 2222,
                                            "name" : "tfjob-port"
                                        }
                                    ]
                                    # "resources": {
                                    #     "limits" : {
                                    #         "cpu": "1"
                                    #     }
                                    # }
                                }
                            ]
                        }
                    }
                },
                "Worker": {
                    "replicas": 1,
                    "restartPolicy": "Never",
                    "template": {
                        "metadata": {
                            "annotations": {
                                "sidecar.istio.io/inject": "false",
                            }
                        },
                        "spec": {
                            "containers": [
                                {
                                    "name": "tensorflow",
                                    "image": args.hyper_image_uri,
                                    "command": [
                                        "python",
                                        "/opt/trainer/task.py",
                                        "--model_uri=" + args.model_uri,
                                        "--batch_size=${trialParameters.batchSize}",
                                        "--learning_rate=${trialParameters.learningRate}"
                                    ],
                                    "ports" : [
                                        {
                                            "containerPort": 2222,
                                            "name" : "tfjob-port"
                                        }
                                    ]
                                    # "resources": {
                                    #     "limits" : {
                                    #         "nvidia.com/gpu": 1
                                    #     }
                                    # }
                                }
                            ]
                        }
                    }
                }
            }
        }
    }


    # Configure parameters for the Trial template.
    trial_template = V1beta1TrialTemplate(
        primary_container_name="tensorflow",
        trial_parameters=[
            V1beta1TrialParameterSpec(
                name="batchSize",
                description="batch size",
                reference="batch_size"
            ),
            V1beta1TrialParameterSpec(
                name="learningRate",
                description="Learning rate",
                reference="learning_rate"
            ),
        ],
        trial_spec=trial_spec
    )

    # Create an Experiment from the above parameters.
    experiment_spec = V1beta1ExperimentSpec(
        max_trial_count=max_trial_count,
        max_failed_trial_count=max_failed_trial_count,
        parallel_trial_count=parallel_trial_count,
        objective=objective,
        algorithm=algorithm,
        parameters=parameters,
        trial_template=trial_template
    )

    experiment_name = args.experiment_name
    experiment_namespace = args.experiment_namespace

    logger.info("Creating Experiment: {} in namespace: {}".format(experiment_name, experiment_namespace))

    # Create Experiment object.
    experiment = V1beta1Experiment(
        api_version="kubeflow.org/v1beta1",
        kind="Experiment",
        metadata=V1ObjectMeta(
            name=experiment_name,
            namespace=experiment_namespace
        ),
        spec=experiment_spec
    )
    logger.info("Experiment Spec : " + str(experiment_spec))
    
    
    logger.info("Experiment: " + str(experiment))

    # Create Katib client.
    katib_client = KatibClient()
    # Create Experiment in Kubernetes cluster.
    output = katib_client.create_experiment(experiment, namespace=experiment_namespace)

    # Wait until Experiment is created.
    end_time = datetime.datetime.now() + datetime.timedelta(minutes=60)
    while True:
        current_status = None
        # Try to get Experiment status.
        try:
            current_status = katib_client.get_experiment_status(name=experiment_name, namespace=experiment_namespace)
        except Exception:
            logger.info("Waiting until Experiment is created...")
        # If current status is set, exit the loop.
        if current_status is not None:
            break
        # If timeout has been reached, rise an exception.
        if datetime.datetime.now() > end_time:
            raise Exception("Timout waiting for Experiment: {} in namespace: {} to be created".format(
                experiment_name, experiment_namespace))
        time.sleep(1)

    logger.info("Experiment is created")

    # Wait for Experiment finish.
    wait_experiment_finish(katib_client, experiment, args.experiment_timeout_minutes)

    # Check if Experiment is successful.
    if katib_client.is_experiment_succeeded(name=experiment_name, namespace=experiment_namespace):
        logger.info("Experiment: {} in namespace: {} is successful".format(
            experiment_name, experiment_namespace))

        optimal_hp = katib_client.get_optimal_hyperparameters(
            name=experiment_name, namespace=experiment_namespace)
        logger.info("Optimal hyperparameters:\n{}".format(optimal_hp))

        # # Create dir if it doesn't exist.
        # if not os.path.exists(os.path.dirname("output.txt")):
        #     os.makedirs(os.path.dirname("output.txt"))
        # Save HyperParameters to the file.
        with open("output.txt", 'w') as f:
            f.write(json.dumps(optimal_hp))
    else:
        logger.info("Experiment: {} in namespace: {} is failed".format(
            experiment_name, experiment_namespace))
        # Print Experiment if it is failed.
        experiment = katib_client.get_experiment(name=experiment_name, namespace=experiment_namespace)
        logger.info(experiment)

    # Delete Experiment if it is needed.
    if args.delete_after_done:
        katib_client.delete_experiment(name=experiment_name, namespace=experiment_namespace)
        logger.info("Experiment: {} in namespace: {} has been deleted".format(
            experiment_name, experiment_namespace))
        
    write_time("hyper_parameter_tuning_end", args.time_loc)

Dockerfile

FROM gcr.io/deeplearning-platform-release/tf-gpu.2-8

# installing packages
RUN pip install pandas
RUN pip install gcsfs
RUN pip install google-cloud-storage
RUN pip install pytz
RUN pip install kubernetes
RUN pip install kubeflow-katib
# moving code to preprocess

RUN mkdir /hp_tune
COPY task.py /hp_tune

# CREDENTIAL Authentication
COPY /prj-vertex-ai-2c390f7e8fec.json /hp_tune/prj-vertex-ai-2c390f7e8fec.json
ENV GOOGLE_APPLICATION_CREDENTIALS="/hp_tune/prj-vertex-ai-2c390f7e8fec.json"

# entry point
ENTRYPOINT ["python3", "/hp_tune/task.py"]

gcr.io/…/hptunekatib:v7

# import os
# os.system("pip install tensorflow-gpu==2.8.0")

from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import os
from tensorflow.keras.layers import Conv1D, MaxPool1D ,Embedding ,concatenate
from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense,Input 
from tensorflow.keras.models import Model 
from tensorflow import keras
from datetime import datetime
from pytz import timezone
from sklearn.model_selection import train_test_split
import pandas as pd
from google.cloud import storage
import argparse
import logging

logger = logging.getLogger()
logging.basicConfig(level=logging.INFO)    

logger.info("Num GPUs Available: " + str(tf.config.list_physical_devices('GPU')))
import subprocess
process = subprocess.Popen(['sh', '-c', 'nvidia-smi'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
logger.info("NVIDIA SMI " + str(out))
def format_strs(x):
    strs = ""
    if x > 0:
        sign_t = "+"
        strs += "+"
    else:
        sign_t = "-"
        
        strs += "-"
        
    strs = strs + "{:.1e}".format(x)
    
    if "+" in strs[1:]:
        sign = "+"
        strs = strs[1:].split("+")
    else:
        sign = "-"
        strs = strs[1:].split("-")
        
    last_d = strs[1][1:] if strs[1][0] == "0" else strs[1]
    
    strs_f = sign_t + strs[0] + sign + last_d
    return strs_f
    
def get_args():
    '''Parses args. Must include all hyperparameters you want to tune.'''

    parser = argparse.ArgumentParser()
    
    parser.add_argument(
      '--learning_rate',
      required=True,
      type=float,
      help='learning_rate')
    
    parser.add_argument(
      '--batch_size',
      required=True,
      type=int,
      help='batch_size')
    
    parser.add_argument(
      '--model_uri',
      required=True,
      type=str,
      help='Model Uri')
    
    args = parser.parse_args()
    return args

def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket."""
    # The ID of your GCS bucket
    # bucket_name = "your-bucket-name"

    # The ID of your GCS object
    # source_blob_name = "storage-object-name"

    # The path to which the file should be downloaded
    # destination_file_name = "local/path/to/file"

    storage_client = storage.Client()

    bucket = storage_client.bucket(bucket_name)

    # Construct a client side representation of a blob.
    # Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
    # any content from Google Cloud Storage. As we don't need additional data,
    # using `Bucket.blob` is preferred here.
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)


def create_dataset():
    
    download_blob("faris_bucket_us_central", "Pipeline_data/input_dataset/dbpedia_model/data/" + "train.csv", "train.csv")
    
    trainData = pd.read_csv('train.csv')
    trainData.columns = ['label','title','description']
    
    # trainData = trainData.sample(frac=0.002)
    
    X_train, X_test, y_train, y_test = train_test_split(trainData['description'], trainData['label'], stratify=trainData['label'], test_size=0.1, random_state=0)
    
    return X_train, X_test, y_train, y_test


def train_model(train_X, train_y, test_X, test_y, learning_rate, batch_size):
  
    logger.info("Training with lr = " + str(learning_rate) + "bs = " + str(batch_size))
    bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
    bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/2", trainable=False)

    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    preprocessed_text = bert_preprocess(text_input)
    outputs = bert_encoder(preprocessed_text)

    # Neural network layers
    l = tf.keras.layers.Dropout(0.2, name="dropout")(outputs['pooled_output']) # dropout_rate
    l = tf.keras.layers.Dense(14,activation='softmax',kernel_initializer=tf.keras.initializers.GlorotNormal(seed=24))(l) # dense_units

    model = Model(inputs=[text_input], outputs=l)

    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),loss='categorical_crossentropy',metrics=['accuracy'])
    
    history = model.fit(train_X, train_y, epochs=5, validation_data=(test_X, test_y), batch_size=batch_size)
    
    return model, history


def main():
    
    args = get_args()
    logger.info("Creating dataset")
    train_X, test_X, train_y, test_y = create_dataset()
    
    # one_hot_encoding the class label
    encoder = LabelEncoder()
    encoder.fit(train_y)
    y_train_encoded = encoder.transform(train_y)
    y_test_encoded = encoder.transform(test_y)

    y_train_ohe = tf.keras.utils.to_categorical(y_train_encoded)
    y_test_ohe = tf.keras.utils.to_categorical(y_test_encoded)
    
    logger.info("Training model")
    model = train_model(
        train_X,
        y_train_ohe,
        test_X,
        y_test_ohe,
        args.learning_rate,
        int(float(args.batch_size))
    )
    
    logger.info("Saving model")
    artifact_filename = 'saved_model'
    local_path = artifact_filename
    tf.saved_model.save(model[0], local_path)
    
    # Upload model artifact to Cloud Storage
    model_directory = args.model_uri + "-".join(os.environ["HOSTNAME"].split("-")[:-2]) + "/"
    local_path = "saved_model/assets/vocab.txt"
    storage_path = os.path.join(model_directory, "assets/vocab.txt")
    blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
    blob.upload_from_filename(local_path)
    
    local_path = "saved_model/variables/variables.data-00000-of-00001"
    storage_path = os.path.join(model_directory, "variables/variables.data-00000-of-00001")
    blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
    blob.upload_from_filename(local_path)
    
    local_path = "saved_model/variables/variables.index"
    storage_path = os.path.join(model_directory, "variables/variables.index")
    blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
    blob.upload_from_filename(local_path)
    
    local_path = "saved_model/saved_model.pb"
    storage_path = os.path.join(model_directory, "saved_model.pb")
    blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
    blob.upload_from_filename(local_path)

    logger.info("Model Saved at " + model_directory)
    
    logger.info("Keras Score: " + str(model[1].history["accuracy"][-1]))
    
    hp_metric = model[1].history["accuracy"][-1]
    
    print("accuracy =", format_strs(hp_metric))

if __name__ == "__main__":
    main()

Dockerfile

# FROM gcr.io/deeplearning-platform-release/tf-cpu.2-8
FROM gcr.io/deeplearning-platform-release/tf-gpu.2-8

RUN mkdir -p /opt/trainer

# RUN pip install scikit-learn
RUN pip install tensorflow_text==2.8.1
# RUN pip install tensorflow-gpu==2.8.0

# CREDENTIAL Authentication
COPY /prj-vertex-ai-2c390f7e8fec.json /prj-vertex-ai-2c390f7e8fec.json
ENV GOOGLE_APPLICATION_CREDENTIALS="/prj-vertex-ai-2c390f7e8fec.json"

COPY *.py /opt/trainer/

# # RUN chgrp -R 0 /opt/trainer && chmod -R g+rwX /opt/trainer
# RUN chmod -R 777 /home/trainer

ENTRYPOINT ["python", "/opt/trainer/task.py"]

# Sets up the entry point to invoke the trainer.
# ENTRYPOINT ["python", "-m", "trainer.task"]

The pipeline runs but it doesnot use the GPU and this piece of code

logger.info("Num GPUs Available: " + str(tf.config.list_physical_devices('GPU')))
import subprocess
process = subprocess.Popen(['sh', '-c', 'nvidia-smi'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
logger.info("NVIDIA SMI " + str(out))

gives empty list and empty string. It is like the GPU doesnot exist. I am attaching the logs of the container

insertId | labels."compute.googleapis.com/resource_name" | labels."k8s-pod/group-name" | labels."k8s-pod/job-name" | labels."k8s-pod/replica-index" | labels."k8s-pod/replica-type" | labels."k8s-pod/training_kubeflow_org/job-name" | labels."k8s-pod/training_kubeflow_org/operator-name" | labels."k8s-pod/training_kubeflow_org/replica-index" | labels."k8s-pod/training_kubeflow_org/replica-type" | logName | receiveLocation | receiveTimestamp | receivedLocation | resource.labels.cluster_name | resource.labels.container_name | resource.labels.location | resource.labels.namespace_name | resource.labels.pod_name | resource.labels.project_id | resource.type | severity | textPayload | timestamp
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
saaah727bfds9ymw | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stdout | 2022-07-11T06:07:35.222632672Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | INFO | accuracy = +9.9e-1 | 2022-07-11T06:07:30.812554270Z
cg5hf72zfi4a8ymi | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | INFO:root:Num GPUs Available: [] | 2022-07-11T06:07:30.812527036Z
0n32rintpe0v865p | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811609: I   tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does   not appear to be running on this host (dbpedia-exp-1-ntq7tfvj-ps-0):   /proc/driver/nvidia/version does not exist | 2022-07-11T06:07:30.812519914Z
et3b3w8ji0nlmfc3 | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811541: W   tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit:   UNKNOWN ERROR (303) | 2022-07-11T06:07:30.812511863Z
u8jhqsnsjg3n114l | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811461: W   tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load  /kind bug

What did you expect to happen:

I expected the pipeline stage to use GPU and run the text classiication using GPU but it doesnt.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Katib version (check the Katib controller image version): v0.13.0
Kubernetes version: (kubectl version): 1.22.8-gke.202
OS (uname -a): linux/ COS in containers

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

Issue Analytics

State:
Created a year ago
Comments:8 (4 by maintainers)

Top GitHub Comments

2reactions

johnugeorgecommented, Jul 15, 2022

This is not specific to Katib. It means that trials could not find a node which satisfies these resource requirements to start the pod One thing to note: When you add resource requirements to trial spec, every trial pod will try to request the same set of resources when run in parallel. Eg: If trialSpec has 1 GPU requirement and if experimentSpec allows 3 parallelTrials, then each trial pod will request 1 GPU each(total of 3 GPUs)

0reactions

AlexandreBrowncommented, Aug 8, 2022

Here is the gist of my working sample, you can ignore the node selector stuff, it just helps to schedule the pod on the gpu node I want (dedicated for training in my case) :

trial_spec={
        "apiVersion": "batch/v1",
        "kind": "Job",
        "spec": {
            "template": {
                "metadata": {
                    "annotations": {
                         "sidecar.istio.io/inject": "false"
                    }
                },
                "spec": {
                    "affinity": {
                        "nodeAffinity": {
                            "requiredDuringSchedulingIgnoredDuringExecution": {
                                "nodeSelectorTerms": [
                                    {
                                        "matchExpressions": [
                                            {
                                                "key": "k8s.amazonaws.com/accelerator",
                                                "operator": "In",
                                                "values": [
                                                    "nvidia-tesla-v100"
                                                ]
                                            },
                                            {
                                                "key": "ai-gpu-2",
                                                "operator": "In",
                                                "values": [
                                                    "true"
                                                ]
                                            }
                                        ]
                                    }
                                ]
                            }

                        }
                    },
                    "containers": [
                        {
                            "resources" : {
                                "limits" : {
                                    "nvidia.com/gpu" : 1
                                }
                            },
                            "name": training_container_name,
                            "image": "xxxxxxxxxxxxxxxxxxxxx__YOUR_IMAGE_HERE_xxxxxxxxxxxxxx",
                            "imagePullPolicy": "Always",
                            "command": train_params + [
                                "--learning_rate=${trialParameters.learning_rate}",
                                "--optimizer=${trialParameters.optimizer}",
                                "--batch_size=${trialParameters.batch_size}",
                                "--max_epochs=${trialParameters.max_epochs}"
                            ]
                        }
                    ],
                    "restartPolicy": "Never",
                    "serviceAccountName": "default-editor"
                }
            }
        }
    }

Top Results From Across the Web

Tensorflow cannot open libcuda.so.1 - Stack Overflow

libcuda.so.1 is a symlink to a file that is specific to the version of your NVIDIA drivers. It may be pointing to the...

dlerror: libcuda.so.1: cannot open shared object file: No such ...

The guide for using NVIDIA CUDA on Windows Subsystem for Linux. There, you can find instructions to install the CUDA toolkit and a...

Conda Tensorflow-GPU installation missing libcuda.so.1

I'm trying to install tensorflow-gpu on a linux machine. ... 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such ...

Kubeflow Katib & Hyperparameter Tuning - YouTube

Your browser can't play this video. Learn more. Switch camera. Share ... Kubeflow Katib & Hyperparameter Tuning — Richard Liu, Google.

Could Not Load Dynamic Library Libcuda.So.1 Error On ...

Couldn't open CUDA library libcuda.so.1. libcuda.so.1 is a symlink to a file that ... libraries: libaio.so.1: cannot open shared object file: No such...