question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Provide documentation for running on k8s with run-config based flow

See original GitHub issue

Description

As described in https://github.com/PrefectHQ/prefect/issues/3509, we have problems enabling debug level logging for our DaskKubernetesEnvironment.

Therefore, we tried out flow.run_config, which was successful.

However, it seems like tasks are not parallelized properly.

We tried both flow.executor = DaskExecutor() and flow.executor = LocalDaskExecutor(num_workers=...), but got the following gantt charts:

flow.run_config :

image

flow.run_config - flow.executor = DaskExecutor() : image

flow.run_config - flow.executor = LocalDaskExecutor(num_workers=20) :

image

Although the tasks do parallelize for the LocalDaskExecutor, they take 20 times as long (36 s instead of 1-2 seconds).

Expected Behavior

  • screenshot of parallelized run (albeit without debug level logging)

DaskKubernetesEnvironment: image

Reproduction

  • add parallelization to minimal example

Environment

  • AWS EKS cluster
## KUBERNETES
prefect_agent_labels = {
	"app": "prefect-agent",
}

prefect_agent_deployment = k8s.apps.v1.Deployment(
	"prefect-agent",
	spec={
		"replicas": 1,
		"selector": {
			"match_labels": prefect_agent_labels
		},
		"template": {
			"metadata": {
				"labels": prefect_agent_labels
			},
			"spec": {
				"containers": [{
					"name": "agent",
					"resources": {
						"limits": {
							"cpu": "100m",
							"memory": "128Mi",
						}
					},
					"args": [
						"prefect agent start kubernetes",
					],
					"command":[
						"/bin/bash", "-c"
					],
					"env":[
						{"name": "PREFECT__CLOUD__AGENT__AUTH_TOKEN",
                        "value": PREFECT_AGENT_AUTH},
						{"name": "PREFECT__CLOUD__API",
						 "value": "https://api.prefect.io", },
						{"name": "NAMESPACE",
						 "value": "default", },
						{"name": "IMAGE_PULL_SECRETS",
						 "value": "", },
						{"name": "PREFECT__CLOUD__AGENT__LABELS",
						 "value": PREFECT_AGENT_LABELS, },
						{"name": "PREFECT__LOGGING__LEVEL",
						 "value": "DEBUG"},
						{"name": "PREFECT__CLOUD__LOGGING__LEVEL",
						 "value": "DEBUG"},
						{"name": "JOB_MEM_REQUEST",
						 "value": "", },
						{"name": "JOB_MEM_LIMIT",
						 "value": "", },
						{"name": "JOB_CPU_REQUEST",
						 "value": "", },
						{"name": "JOB_CPU_LIMIT",
						 "value": "", },
						{"name": "IMAGE_PULL_POLICY",
						 "value": "", },
						{"name": "SERVICE_ACCOUNT_NAME",
						 "value": "", },
						{"name": "PREFECT__BACKEND",
						 "value": "cloud", },
						{"name": "PREFECT__CLOUD__AGENT__AGENT_ADDRESS",
						 "value": "http://:8080", },
					],
					"image": PREFECT_AGENT_IMAGE,
					#"image": "prefecthq/prefect:0.12.3-python3.6",
					"image_pull_policy": "Always",
					"liveness_probe": {
						"failure_threshold": 2,
						"http_get": {
							"path": "/api/health",
							"port": 8080
						},
						"initial_delay_seconds": 40,
						"period_seconds": 40,
					},
				}]
			}
		}
	},
	opts=pulumi.ResourceOptions(provider=k8s_provider)
)

Optionally run prefect diagnostics from the command line and paste the information here. -->

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
jcristcommented, Oct 16, 2020

Definitely! I’ll ping you once we have some docs up. I’m going to change this issue to a docs issue for running on k8s run config, since that’ll be mostly taking the above info and converting to docs 😃.

1reaction
jcristcommented, Oct 15, 2020

Hi @Zaubeerer, sorry to hear you’re having issues. The run_config stuff is still experimental (and undocumented), so thanks for trying things out. I believe all the issues you’re having have to do with your prefect/k8s configuration, and aren’t evidence of a bug in prefect/dask/something else. Responding inline:

We tried both flow.executor = DaskExecutor() and flow.executor = LocalDaskExecutor(num_workers=...), but got the following gantt charts:

From the community slack, it looks like you’re used to using the DaskKubernetesEnvironment, which runs your flow spread out across a number of k8s pods. Switching to using KubernetesRun will have your flow run using a single k8s pod (without additional executor configuration) - depending on your flow this may be fine or lead to perf issues. By default the flow will run serially, one task at a time, but (as you’ve rightly noted) you can set an executor on the flow to re-enable parallelism.

Both options you’ve tried will run within the same pod using either threads or processes - but for this to be effective, the pod needs sufficient resources to run in parallel. For IO heavy tasks, using more threads than you have cores allocated to your pod is fine, but for compute heavy tasks you don’t want more processes/threads than you have cores (I’m not sure what your flow is doing, but it sounds like it’s compute heavy).

In the latest release (0.13.11) there are no default resource limits for a job, your job will be scheduled wherever the k8s cluster sees fit. In previous releases, the cpu request/limit were limited to < 0.1 cpu, which isn’t enough for any compute-heavy task. You can always set your own limit/request (as you’ve found) by passing cpu_limit/cpu_request to KubernetesRun.

With the following code, the flow is not even started. Is there a contradiction of what the flow asks for (5 CPUs) and what the AWS EKS is configured for?

Yeah, if you request a pod that requires more resources than any node in the cluster can give, it will hang until there’s a place for it (possibly forever). This depends on the node configurations of your system, and if you have any auto-scaling node pools. Running everything in a single pod is simpler, and is what I’d recommend when possible, but once your compute gets large enough it may be less expensive to scale out to multiple pods than require a larger cluster node be available.

I just tried to add n_workers as kwarg to DaskExecutor, but it does not seem to be successful, as the flow does not start…

From that configuration, I’d guess you’re swamping the node, as it doesn’t have enough cpu to manage that many workers. The flow would eventually run, but it would be so slow because everything is highly oversubscribed.


I believe the above explains the issues you’ve been seeing, please let me know if you have any questions.

I plan to write a bunch of example docs for deployments with run configs, and there will certainly be a large section on k8s (both with and without dask), but for now there’s unfortunately not anything I can point you to. To use dask-kubernetes without the DaskKubernetesEnvironment, you’d want to configure your DaskExecutor with a cluster_class/cluster_kwargs/adapt_kwargs to start and scale a dask_kubernetes.KubeCluster appropriately.

  • cluster_class: either the import name of a callable, or a callable to create a dask cluster. Ideally "dask_kubernetes.KubeCluster" would be all you’d need, but not all options you might want to change (like the worker images) are exposed in the constructor without writing k8s specs. This is unfortunate, I’m hoping to upstream some patches to make this easier to do.
  • cluster_kwargs: kwargs to pass to cluster_class at runtime to create a cluster.
  • adapt_kwargs: kwargs to pass to cluster.adapt at runtime to enable adaptive scaling, if desired.

Ideally I’d like the following to work:

import dask_kubernetes

flow.executor = DaskExecutor(
    cluster_class="dask_kubernetes.KubeCluster",
    cluster_kwargs={"image": "your-worker-image-here", "n_workers": 4}
)

But image isn’t exposed in the KubeCluster constructor, so you’d need to do something more like:

import dask_kubernetes

def make_cluster():
    """A function that makes a dask-kubernetes cluster with the configuration you want"""
    spec = dask_kubernetes.make_pod_spec(
        image="your-worker-image",
        ...
    )
    cluster = dask_kubernetes.KubeCluster(spec, n_workers=4)
    return cluster

flow.executor = DaskExecutor(cluster_class=make_cluster)

# if you wanted adaptive scaling instead of fixed scaling, you'd drop the
# `n_workers` in `make_cluster` and do something like:
# 
# flow.executor = DaskExecutor(cluster_class=make_cluster, adapt_kwargs={"maximum": 10})

(you might find the dask-kubernetes docs useful here).

Apologies that there aren’t better docs on this - as I said before you’re using some very new features. If you do try out the above, please let us know how things work for you.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Run Configuration - Prefect Docs
RunConfig objects define where and how a flow run should be executed. Each RunConfig type has a corresponding Prefect Agent (i.e. LocalRun pairs...
Read more >
Prefect 0.14.0: Improved Flow Run Configuration - Medium
RunConfig objects define where and how a flow run should be executed. Each RunConfig type has a corresponding Prefect Agent (i.e. LocalRun ...
Read more >
Configuration Best Practices - Kubernetes
This document highlights and consolidates configuration best practices that are introduced throughout the user guide, Getting Started ...
Read more >
prefect.io kubernetes agent and task execution - Stack Overflow
While reading kubernetes agent documentation, I am getting confused with below line. "Configure a flow-run to run as a Kubernetes Job.".
Read more >
Configuration — Kedro 0.18.4 documentation - Read the Docs
This recursively scans for configuration files firstly in the conf/base/ ... Kedro also allows you to specify runtime parameters for the kedro run...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found