Trial metrics-logger-and-collector breaks GKE Node-Pool Autoscaling
See original GitHub issue/kind bug
Hello! TL;DR: the ephemeral storage requests in the metrics logger container that is part of the Trial pod prevents GKE node-pool autoscaling from scaling up if current scale is 0.
What steps did you take and what happened: In order to get the best possible results from our HPO training and also keep costs low we want to run Trials on auto-scaling GPU node-pools in GKE. By default these node-pools should be scaled to 0 nodes when we aren’t running any jobs, and then as users submit Experiments to the system the GPU node-pools will scale up as Trials are created. We quickly noticed that Trial pods were sitting in the unschedulable state and the node-pools were not scaling up as they should. Upon further investigation we found this error:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 16m (x33 over 36m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 6 Insufficient nvidia.com/gpu, 26 Insufficient ephemeral-storage
coupling this with the following information from these docs: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler Cluster autoscaler has the following limitations:
- Scaling up a node group of size 0, for Pods requesting resources beyond CPU, memory and GPU (ex. ephemeral-storage).
That should be fine right? In the TrialTemplate we define the following resources requests: cpu: 3000m memory: 10Gi nvidia.com/gpu: 1
So where does the ephemeral-storage request come from? The offender is the metrics-logger-and-collector container that exists in the Trial pod:
...
containers:
<training-container>
...
metrics-logger-and-collector:
Image: gcr.io/kubeflow-images-public/katib/v1beta1/file-metrics-collector
Port: <none>
Host Port: <none>
Args:
-t
hyperparam-tuning-<id>-train-<id>
-m
user_metric
-s
katib-db-manager.kubeflow:6789
-path
/var/log/katib/metrics.log
Limits:
cpu: 500m
ephemeral-storage: 5Gi
memory: 100Mi
Requests:
cpu: 50m
ephemeral-storage: 500Mi
memory: 10Mi
Environment: <none>
Mounts:
/var/log/katib from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-<id> (ro)
So that’s the problem we’re dealing with. Technically we could keep our node pools scaled to 1 at all times to work around this, but this is impractical from a cost perspective.
Is this ephemeral-storage request/limit really necessary? This is still a beta feature in both Kubernetes and GKE. Is there some way we can work around this? Or do we need to roll our own katib-controller or node-pool autoscaler? Would you consider removing this request from the metric collector until it moves out of beta? Honestly any suggestions would be helpful.
What did you expect to happen:
- auto-magical auto-scaling
Environment:
- GKE
- Katib v1beta1
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:7 (2 by maintainers)
Top GitHub Comments
Thank you for the issue @kylepad.
Yes, you can easily control
Limits
andRequests
for your metrics collector container usingkatib-config
. Check here: https://master.kubeflow.org/docs/components/hyperparameter-tuning/katib-config/#metrics-collector-sidecar-settings. You just need to modifykatib-config
configMap with require values for your metrics collector (StdOut, File, etc…) and submit experiment. You don’t need to restart controller. Here is an example for not default limits forTensorFlowEvent
mc: https://github.com/kubeflow/katib/blob/9cf45448e40cff0d558e2266a659091bd06e8e44/manifests/v1beta1/katib-controller/katib-config.yaml#L15-L23Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! Links: app homepage, dashboard and code for this bot.