question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] - Unable to force Dask workers to run on AWS EKS specific nodegroups

See original GitHub issue

OS system and architecture in which you are running QHub

Amazon Linux 2 on AWS

Expected behavior

ℹ️ Be able to place Dask workers and/or scheduler pods running on specific AWS EKS node groups, based on the information provided in the qhub-config.yaml file (dask_worker and node_groups profiles)

Actual behavior

Whenever a Qhub deployment is done on AWS, and a custom dask_worker entry/profile is created in the qhub-config.yaml file referring to one of the specified node_groups entries also available in the config file, the Kubernetes cluster seems to fail placing the Dask worker(s) and scheduler(s) pods in the appropriate EKS node group(s).

  • The qhub init command with the appropriate arguments is executed and generates the qhub-config.yaml file;
  • Some modifications are made to the config file, in this case, EKS node_groups profiles are added;
  • Additionally, dask_worker profiles are also added to the config file, as well as the nodeSelector key/value pair which is specified so that it references the desired Kubernetes node where the Dask scheduler(s) and/or worker(s) is/are going to be executed on. As described in the documentation -> Setting specific dask workers to run on a nodegroup
  • The qhub deploy -c qhub-config.yaml --disable-render (since I normally render the files before due to VPC settings that need changing in my setup) command is executed successfully and deployment goes through as expected.
  • When the shared/examples/dask-gateway.ipynb is used to test Dask, by setting up the options to Environment = filesystem/dask and Cluster Profile = GPU Worker / 4xCPU Cores / 30GB MEM / 1x GPU (gpu custom profile created in the qhug config) the Dask Gateway tries to place the Dask worker pod in the worker node_group instead of gpu (I’ve tried different configurations, including specifying the same worker name in scheduler_extra_pod_config and worker_extra_pod_config)

qhub-config.yaml file looks something like this:

project_name: my-qhub-deploy
provider: aws
...
terraform_state:
  type: remote
namespace: dev
qhub_version: 0.4.3
amazon_web_services:
  region: us-east-1
  kubernetes_version: '1.22'
  node_groups:
    general:
      instance: m6i.xlarge
      min_nodes: 1
      max_nodes: 1
    user:
      instance: m6i.xlarge
      min_nodes: 1
      max_nodes: 5
    worker:
      instance: m6i.2xlarge
      min_nodes: 2
      max_nodes: 10
    gpu:
      instance: g4dn.2xlarge
      min_nodes: 1
      max_nodes: 3
      gpu: true
...
  dask_worker:
    "GPU Worker / 4xCPU Cores / 30GB MEM / 1x GPU":
      worker_cores_limit: 4
      worker_cores: 4
      worker_memory_limit: 30G
      worker_memory: 30G
      worker_threads: 6
      scheduler_extra_pod_config:
        nodeSelector:
          "eks.amazonaws.com/nodegroup": worker
      worker_extra_pod_config:
        nodeSelector:
          "eks.amazonaws.com/nodegroup": gpu
    "Large Worker / 4xCPU Cores / 30GB MEM / no GPU":
      worker_cores_limit: 4
      worker_cores: 4
      worker_memory_limit: 30G
      worker_memory: 30G
      worker_threads: 8
      scheduler_extra_pod_config:
        nodeSelector:
          "eks.amazonaws.com/nodegroup": worker
      worker_extra_pod_config:
        nodeSelector:
          "eks.amazonaws.com/nodegroup": worker
...

While troubleshooting, I had a look into the K8s qhub-daskgateway-gateway secret which holds the config.json key (a Base64 encoded JSON payload). It seems that worker-node-group key is always the same (containing only the entry for “worker”):

"worker-node-group": {
    "key": "eks.amazonaws.com/nodegroup",
    "value": "worker"
}

It seems a bit strange because the additional dask worker node was specified under node_groups in the qhub-config.yaml and the appropriate nodeSelector was added to the dask_worker entries.

While troubleshooting, I decided to put together a flowchart to help me go through the process. Here it goes: troubleshooting_issue

I’m not sure if something went wrong on my end, but I ran a couple of clean deployments just by generating a qhub-config.yaml file using qhub init and adding the node_group entries as well the dask_workers.

My apologies for such long description. Kudos to everyone here for such a great piece of software that qhub is 🚀

How to Reproduce the problem?

  • Created a new Python 3.9 env and install qhub
  • Did a setup initialization for AWS eg. qhub init aws --project my-project --domain qhub.mydomain.com --ssl-cert-email joao@limacarvalho.com
  • Added to the auto-generated qhub-config.yaml file a new entry under node_groups and a new entry also under dask_worker:
amazon_web_services:
  region: us-east-1
  kubernetes_version: '1.22'
  node_groups:
    general:
      instance: m6i.xlarge
      min_nodes: 1
      max_nodes: 1
    user:
      instance: m6i.xlarge
      min_nodes: 1
      max_nodes: 5
    worker:
      instance: m6i.2xlarge
      min_nodes: 1
      max_nodes: 10
    gpu:
      instance: g4dn.2xlarge
      min_nodes: 1
      max_nodes: 3
      gpu: true
...
  dask_worker:
    "GPU Worker":
      worker_cores_limit: 4
      worker_cores: 4
      worker_memory_limit: 30G
      worker_memory: 30G
      worker_threads: 6
      scheduler_extra_pod_config:
        nodeSelector:
          "eks.amazonaws.com/nodegroup": gpu
      worker_extra_pod_config:
        nodeSelector:
          "eks.amazonaws.com/nodegroup": gpu
...
  • Executed qhub deploy -c qhub-config.yaml
  • System deployed successfully, but Dask fails to place the worker pod on the appropriate EKS node eg. gpu

Command output

-

Versions and dependencies used.

  • qhub_version: 0.4.3
  • amazon_web_services (region): us-east-1
  • kubernetes_version: ‘1.22’

Compute environment

AWS

Integrations

Dask

Anything else?

✅ I was able to work around this issue by adjusting the dask_gateway_config.py file that is part of the K8s configmap/qhub-daskgateway-gateway.

Before:

def base_node_group():
    worker_node_group = {
        config["worker-node-group"]["key"]: config["worker-node-group"]["value"]
    }

    return {
        "scheduler_extra_pod_config": {"nodeSelector": worker_node_group},
        "worker_extra_pod_config": {"nodeSelector": worker_node_group},
    }

After:

def base_node_group(options):
    worker_node_group = config["profiles"][options.profile]["worker_extra_pod_config"]["nodeSelector"]
    scheduler_node_group = config["profiles"][options.profile]["scheduler_extra_pod_config"]["nodeSelector"]

    return {
        "scheduler_extra_pod_config": {"nodeSelector": scheduler_node_group},
        "worker_extra_pod_config": {"nodeSelector": worker_node_group},
    }

#...
# Adding the "options" object as argument to the base_node_group() function call
def worker_profile(options, user):
    namespace, name = options.conda_environment.split("/")
    return functools.reduce(
        deep_merge,
        [
            base_node_group(options),
            base_conda_store_mounts(namespace, name),
            base_username_mount(user.name),
            config["profiles"][options.profile],
            {"environment": {**options.environment_vars}},
        ],
        {},
    )

So that the worker pod is placed onto the correct node specified in the nodeSelector key available in the qhub-config.yaml. Additionally, I can also specify in which node to run the scheduler pod as well, so that increases the flexibility of the setup (was this the idea when this code was initially pushed?)

Any info or help needed, always feel free to reach out! Many thanks!

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
limacarvalhocommented, Oct 11, 2022

Hi @iameskild,

Apologies for the delay pushing the code. I was trying to deploy a clean v.0.4.4 in order to run an end-to-end test but somehow I got stuck and unable to deploy due to some errors (Attempt 4 failed connecting to keycloak master realm…InsecureRequestWarning: Unverified HTTPS request is being made to host ‘dev.limacarvalho.com’. Adding certificate verification is strongly advised). I still need to figure out if I’m doing something wrong, if my env is messed up or if something related to SSL certs has changed for 0.4.4.

I will keep you updated here about the test result 🙏

1reaction
iameskildcommented, Oct 7, 2022

That’s wonderful! Thank you @limacarvalho 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG] - Unable to force Dask workers to run on AWS EKS specific ...
ℹ️ Be able to place Dask workers and/or scheduler pods running on specific AWS EKS node groups, based on the information provided in...
Read more >
Troubleshoot EKS managed node group failures - Amazon AWS
My Amazon Elastic Kubernetes Service (Amazon EKS) managed node group failed to create. Nodes can't join the cluster and I received an error...
Read more >
Amazon EKS troubleshooting - AWS Documentation
This chapter covers some common errors that you may see while using Amazon EKS and how to work around them. If you need...
Read more >
AWS Support – Knowledge Center
Learn about some of the most frequent questions and requests that we receive from AWS Customers including best practices, guidance, and troubleshooting ...
Read more >
Resolve node group errors in an EKS cluster - Amazon AWS
I have issues with my managed node group in my Amazon Elastic Kubernetes Service (Amazon EKS) cluster. Short description. You receive an error...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found