question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to scale to large number of concurrent Katib Experiments / Trials

See original GitHub issue

/kind bug

What steps did you take and what happened:

We launched 120 Katib Experiments simultaneously, each configured with Parallel Trial Count: 30. While monitoring the cluster via kubectl -n kubeflow get pods | grep -v Completed | wc -l, we saw no more than 160 pods running concurrently.

What did you expect to happen:

We expected 120 * 30 = 3600 pods to be running concurrently, or as close to that as our cluster resources would allow.

Anything else you would like to add:

  • This was run using the v1alpha3 version of Katib

  • Each individual trial is very short-running (approximately 30 seconds)

  • The pod running the katib-controller did not exceed 25% CPU utilization, so it wasn’t obviously underwater trying to reconcile all of the Trials.

  • It took a long time for the Trial objects to reach Created state

  • It took a long time for Created trials to launch a Job object

  • Once the Job object was created, the pod was created shortly thereafter and the task ran successfully.

  • All Katib Trials / Experiments did eventually complete successfully.

Since both of those slow steps are handled by the ReconcileTrial.Reconcile, it makes me think that the delay is somehow related to the trial-controller not receiving Trials to reconcile on a fast enough cadence? Do you agree with that assessment and have any ideas on how to troubleshoot ?

Thank you for your help !

Environment:

  • Kubeflow version (kfctl version):
  • Minikube version (minikube version):
  • Kubernetes version: (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.12-gke.2", GitCommit:"fb7add51f767aae42655d39972210dc1c5dbd4b3", GitTreeState:"clean", BuildDate:"2020-06-01T22:20:10Z", GoVersion:"go1.12.17b4", Compiler:"gc", Platform:"linux/amd64"}
  • OS (e.g. from /etc/os-release):

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
issue-label-bot[bot]commented, Sep 5, 2020

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 0.92

Please mark this comment with 👍 or 👎 to give our bot feedback! Links: app homepage, dashboard and code for this bot.

1reaction
issue-label-bot[bot]commented, Sep 5, 2020

Issue-Label Bot is automatically applying the labels:

Label Probability
area/katib 0.92

Please mark this comment with 👍 or 👎 to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting Started with Katib - Kubeflow
This guide shows how to get started with Katib and run a few examples using the command line and the Katib user interface...
Read more >
Training and Serving ML workloads ... - CERN Document Server
The service should be able to scale individual workloads as well as serve a large amount of concurrent users. • Sustainability, meaning the...
Read more >
Optuna vs Hyperopt: Which Hyperparameter Optimization ...
In this section I want to see how to run a basic hyperparameter tuning script for both libraries, see how natural and easy-to-use...
Read more >
From Notebook to HP Tuning to Kubeflow Pipelines with Kale
As a next step, Kale will scale up the resulting pipeline to multiple parallel runs for hyperparameter tuning using Kubeflow Katib.
Read more >
Training and Serving ML workloads with Kubeflow at CERN
Machine Learning (ML) has been growing in popularity in multiple areas ... active analysis, large scale distributed model training and model ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found