Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trial metrics-logger-and-collector breaks GKE Node-Pool Autoscaling

See original GitHub issue

/kind bug

Hello! TL;DR: the ephemeral storage requests in the metrics logger container that is part of the Trial pod prevents GKE node-pool autoscaling from scaling up if current scale is 0.

What steps did you take and what happened: In order to get the best possible results from our HPO training and also keep costs low we want to run Trials on auto-scaling GPU node-pools in GKE. By default these node-pools should be scaled to 0 nodes when we aren’t running any jobs, and then as users submit Experiments to the system the GPU node-pools will scale up as Trials are created. We quickly noticed that Trial pods were sitting in the unschedulable state and the node-pools were not scaling up as they should. Upon further investigation we found this error:

  Type     Reason             Age                 From                Message
  ----     ------             ----                ----                -------
  Normal   NotTriggerScaleUp  16m (x33 over 36m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 6 Insufficient nvidia.com/gpu, 26 Insufficient ephemeral-storage

coupling this with the following information from these docs: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler Cluster autoscaler has the following limitations:

Scaling up a node group of size 0, for Pods requesting resources beyond CPU, memory and GPU (ex. ephemeral-storage).

That should be fine right? In the TrialTemplate we define the following resources requests: cpu: 3000m memory: 10Gi nvidia.com/gpu: 1

So where does the ephemeral-storage request come from? The offender is the metrics-logger-and-collector container that exists in the Trial pod:

...
containers:
  <training-container>
    ...
  metrics-logger-and-collector:
    Image:      gcr.io/kubeflow-images-public/katib/v1beta1/file-metrics-collector
    Port:       <none>
    Host Port:  <none>
    Args:
      -t
      hyperparam-tuning-<id>-train-<id>
      -m
      user_metric
      -s
      katib-db-manager.kubeflow:6789
      -path
      /var/log/katib/metrics.log
    Limits:
      cpu:                500m
      ephemeral-storage:  5Gi
      memory:             100Mi
    Requests:
      cpu:                50m
      ephemeral-storage:  500Mi
      memory:             10Mi
    Environment:          <none>
    Mounts:
      /var/log/katib from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-<id> (ro)

So that’s the problem we’re dealing with. Technically we could keep our node pools scaled to 1 at all times to work around this, but this is impractical from a cost perspective.

Is this ephemeral-storage request/limit really necessary? This is still a beta feature in both Kubernetes and GKE. Is there some way we can work around this? Or do we need to roll our own katib-controller or node-pool autoscaler? Would you consider removing this request from the metric collector until it moves out of beta? Honestly any suggestions would be helpful.

What did you expect to happen:

auto-magical auto-scaling

Environment:

GKE
Katib v1beta1

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

andreyvelichcommented, Aug 4, 2020

Thank you for the issue @kylepad.

Yes, you can easily control Limits and Requests for your metrics collector container using katib-config. Check here: https://master.kubeflow.org/docs/components/hyperparameter-tuning/katib-config/#metrics-collector-sidecar-settings. You just need to modify katib-config configMap with require values for your metrics collector (StdOut, File, etc…) and submit experiment. You don’t need to restart controller. Here is an example for not default limits for TensorFlowEvent mc: https://github.com/kubeflow/katib/blob/9cf45448e40cff0d558e2266a659091bd06e8e44/manifests/v1beta1/katib-controller/katib-config.yaml#L15-L23

1reaction

issue-label-bot[bot]commented, Aug 4, 2020

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/katib	0.95

Please mark this comment with 👍 or 👎 to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Top Results From Across the Web

About cluster autoscaling | Google Kubernetes Engine (GKE)

GKE's cluster autoscaler automatically resizes the number of nodes in a given node pool, based on the demands of your workloads. When demand...

GKE Cluster Autoscaler: How to troubleshoot and resolve ...

Are you observing scaling issues with your Google Kubernetes Engine ( GKE ) Cluster? Are you interested to learn how to troubleshoot and ......

The shipwreck of GKE Cluster Upgrade - Dmitri Lerko

The NodePool upgrade is a more impactful process. Sequentially, for each node in the NodePool, nodes are stopped from scheduling node Pods, ...

Understanding and Combining GKE Autoscaling Strategies

Automatically create an optimized node pool for workload with Node ... create scaling-demo --num-nodes=3 --enable-vertical-pod-autoscaling.

Kubernetes Cluster Autoscaler - Densify

For example, on GKE the command below enables Cluster Autoscaler on a ... beta container clusters update scaling-demo --enable-autoscaling --min-nodes 1 ...