question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPU Kubeflow cluster timeline and advice

See original GitHub issue

I’ve been having some issues getting enabling gpus on the kubeflow cluster I recently set up.

Per this discussion, it seems that microk8s enable gpu works best for people who already have the nvidia-container-runtime installed on their system for microk8s version 1.22. However, as it’s well known by now, the kubeflow add-on is only supported up to version 1.21 of microk8s. I’ve tried both:

  1. Going through the steps to enable gpus with microk8s v1.21. Logs show the operator still installing its own nvidia-container-runtime, despite my clear statement --set driver.enabled=false when calling helm3 install.

  2. Going through the steps of using juju and charmed operators to bootstrap a kubeflow cluster in microk8s v1.22 and see the same seldon error as reported in #2496 .

What should I do? Uninstall nvidia-container-runtime on my host and cross my fingers microk8s enable gpu will work in that case? If there’s any way I can contribute to getting kubeflow running in microk8s v1.22 I’m willing to chip in and help. Any guidance at all on solving this problem would be greatly appreciated.

inspection-report-20211025_173607.tar.gz

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
ktsakalozoscommented, Oct 28, 2021

Hi @odellus, a suggestion would be to use v1.20 because the GPU support on 1.21 is not in a good state and 1.22 does not have kubeflow.

0reactions
odelluscommented, Dec 3, 2021

The trick to enabling on microk8s v1.20 was to install the cuda drivers with the local .run script instead of the .deb files to install with dpkg. Closing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

GPU Kubeflow cluster timeline and advice #2682 - GitHub
I've been having some issues getting enabling gpus on the kubeflow cluster I recently set up. Per this discussion, it seems that microk8s ......
Read more >
Enabling GPU and TPU - Kubeflow
This page describes how to enable GPU or TPU for a pipeline on GKE by using the Pipelines DSL language. Prerequisites.
Read more >
GPU-as-a-Service on KubeFlow: Fast, Scalable and Efficient ML
The solution users are looking for is one that can harness multiple GPUs for a single task (so it can complete faster) and...
Read more >
GPU Training - AWS Deep Learning Containers
This section demonstrates how to train a model on GPU instances using Kubeflow training operator and Deep Learning Containers.
Read more >
GPU Virtualization in K8s: Challenges and State of the Art
Kubernetes schedules GPU workloads by assigning a whole device to a single job exclusively. This one-to-one relationship leads to massive GPU ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found