Unable to scale to large number of concurrent Katib Experiments / Trials
See original GitHub issue/kind bug
What steps did you take and what happened:
We launched 120 Katib Experiments simultaneously, each configured with Parallel Trial Count: 30
. While monitoring the cluster via kubectl -n kubeflow get pods | grep -v Completed | wc -l
, we saw no more than 160 pods running concurrently.
What did you expect to happen:
We expected 120 * 30 = 3600 pods to be running concurrently, or as close to that as our cluster resources would allow.
Anything else you would like to add:
-
This was run using the
v1alpha3
version of Katib -
Each individual trial is very short-running (approximately 30 seconds)
-
The pod running the
katib-controller
did not exceed 25% CPU utilization, so it wasn’t obviously underwater trying to reconcile all of the Trials. -
It took a long time for the
Trial
objects to reachCreated
state -
It took a long time for
Created
trials to launch aJob
object -
Once the Job object was created, the pod was created shortly thereafter and the task ran successfully.
-
All Katib Trials / Experiments did eventually complete successfully.
Since both of those slow steps are handled by the ReconcileTrial.Reconcile, it makes me think that the delay is somehow related to the trial-controller
not receiving Trials to reconcile on a fast enough cadence? Do you agree with that assessment and have any ideas on how to troubleshoot ?
Thank you for your help !
Environment:
- Kubeflow version (
kfctl version
): - Minikube version (
minikube version
): - Kubernetes version: (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.12-gke.2", GitCommit:"fb7add51f767aae42655d39972210dc1c5dbd4b3", GitTreeState:"clean", BuildDate:"2020-06-01T22:20:10Z", GoVersion:"go1.12.17b4", Compiler:"gc", Platform:"linux/amd64"}
- OS (e.g. from
/etc/os-release
):
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (2 by maintainers)
Top GitHub Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! Links: app homepage, dashboard and code for this bot.