Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to scale to large number of concurrent Katib Experiments / Trials

See original GitHub issue

/kind bug

What steps did you take and what happened:

We launched 120 Katib Experiments simultaneously, each configured with Parallel Trial Count: 30. While monitoring the cluster via kubectl -n kubeflow get pods | grep -v Completed | wc -l, we saw no more than 160 pods running concurrently.

What did you expect to happen:

We expected 120 * 30 = 3600 pods to be running concurrently, or as close to that as our cluster resources would allow.

Anything else you would like to add:

This was run using the v1alpha3 version of Katib
Each individual trial is very short-running (approximately 30 seconds)
The pod running the katib-controller did not exceed 25% CPU utilization, so it wasn’t obviously underwater trying to reconcile all of the Trials.
It took a long time for the Trial objects to reach Created state
It took a long time for Created trials to launch a Job object
Once the Job object was created, the pod was created shortly thereafter and the task ran successfully.
All Katib Trials / Experiments did eventually complete successfully.

Since both of those slow steps are handled by the ReconcileTrial.Reconcile, it makes me think that the delay is somehow related to the trial-controller not receiving Trials to reconcile on a fast enough cadence? Do you agree with that assessment and have any ideas on how to troubleshoot ?

Thank you for your help !

Environment:

Kubeflow version (kfctl version):
Minikube version (minikube version):
Kubernetes version: (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.12-gke.2", GitCommit:"fb7add51f767aae42655d39972210dc1c5dbd4b3", GitTreeState:"clean", BuildDate:"2020-06-01T22:20:10Z", GoVersion:"go1.12.17b4", Compiler:"gc", Platform:"linux/amd64"}

OS (e.g. from /etc/os-release):

Issue Analytics

State:
Created 3 years ago
Comments:9 (2 by maintainers)

Top GitHub Comments

1reaction

issue-label-bot[bot]commented, Sep 5, 2020

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/katib	0.92

Please mark this comment with 👍 or 👎 to give our bot feedback! Links: app homepage, dashboard and code for this bot.

1reaction

issue-label-bot[bot]commented, Sep 5, 2020

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/katib	0.92

Please mark this comment with 👍 or 👎 to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Top Results From Across the Web

Getting Started with Katib - Kubeflow

This guide shows how to get started with Katib and run a few examples using the command line and the Katib user interface...

Training and Serving ML workloads ... - CERN Document Server

The service should be able to scale individual workloads as well as serve a large amount of concurrent users. • Sustainability, meaning the...

Optuna vs Hyperopt: Which Hyperparameter Optimization ...

In this section I want to see how to run a basic hyperparameter tuning script for both libraries, see how natural and easy-to-use...

From Notebook to HP Tuning to Kubeflow Pipelines with Kale

As a next step, Kale will scale up the resulting pipeline to multiple parallel runs for hyperparameter tuning using Kubeflow Katib.

Training and Serving ML workloads with Kubeflow at CERN

Machine Learning (ML) has been growing in popularity in multiple areas ... active analysis, large scale distributed model training and model ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Unable to scale to large number of concurrent Katib Experiments / Trials

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Long term solution for AutoML CI/CD test infrastructure

Katib UI allows you to delete experiments when no namespace is selected on central dashboard