Decouple "Dask" and Environments
See original GitHub issueCurrently prefect cloud/server flow runs are configured by 3 concepts:
Environment
Executor
Storage
while local flow runs are configured with an Executor
alone.
Subclasses of Environment
and Executor
both exist with “dask” in their name, which can make things confusing, especially since the usage differs between local and cloud/server usage.
For example, if you’re trying to run a flow on kubernetes using Dask:
- If you’re running locally, you’d use
dask-kubernetes
to spin up a dask cluster, then configure aDaskExecutor
to use that cluster, then pass that executor toflow.run
. - If you’re running using cloud, you’d configure a
DaskKubernetesEnvironment
and attach it to your flow. Environments also have executors, which makes this confusing, since you have aDask*
thing inside aDask*
thing.
If you then decide you want to not use a dask executor but still run on cloud, you need to change your environment to a KubernetesJobEnvironment
. But you still can use a dask executor inside the KubernetesJobEnvironment
? I find the current mixing of concepts a bit confusing - it feels like environments are encroaching in executors’ space.
I think a potentially cleaner way of specifying flow execution would be to remove anything “dask” related from environments entirely, and move them to executors alone. The separation of duties would be:
-
Environment
:Configures where and how to run the process that starts up a
FlowRunner
. This may be a kubernetes job, a docker container, a local process, etc… -
Executor
:Configures where and how to run tasks inside a single flow run. An executor could run things without dask, with dask locally, connect to a remote dask cluster, or start a new temporary dask cluster just for this flow run. This is basically how things work now (minus the cluster creation).
One nice thing about this is that the only difference between running a flow locally and running a flow on cloud is the addition of the Environment
. If you want to use a dask cluster on kubernetes to run your flow, in both cases you configure an Executor
to spin up a dask cluster using dask-kubernetes
. The Environment
is only responsible for managing the initial process, not the dask processes.
If people like this plan, I think a way forward would be:
-
Create an
Executor
class that manages a dask cluster lifetime (starting/stopping/scaling) using the standard dask cluster interface (similar to #2503, which I think is superseded by this issue). This might look like:executor = DaskExecutor( cluster_class="dask_cloudprovider.FargateCluster", cluster_kwargs={ "image": "my-fancy-prefect-image", }, ... ) ...
The lifetime of the dask cluster created by this executor would be the lifetime of the flow run. If we needed to have specific classes for e.g.
DaskKubernetesExecutor
we could, but if possible I’d prefer to keep this generic (as specified in #2503). -
Deprecate the existing
Dask*Environment
classes, with instructions for how to move to using the new executor. ForDaskKubernetesEnvironment
for example, an equivalent configuration might be:environment = KubernetesJobEnvironment( executor_class=DaskExecutor, executor_kwargs=dict( cluster_class="dask_kubernetes.KubeCluster", cluster_kwargs={...} ) ) ...
Thoughts?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:8
@jcrist The decoupling of these makes sense. It would also help to keep parity between Core and Cloud execution options, e.g. right now Dask Cloud Provider is an option for an Environment with Cloud but not as an Executor with Core. I really like this idea. (I agree that this issue should supersede https://github.com/PrefectHQ/prefect/issues/2503. I’ll go comment there now, but for me it’s ok to close that one in favor of this.)
This is done in #2667.