Support cancellation of active flow runs
See original GitHub issueCurrently flow runs can be transitioned into a Cancelled
state, but that only prevents new tasks in the flow run from starting - it does nothing to stop already running tasks (it’s also not exposed nicely in the UI).
In practice this is hard to do effectively in all situations, there is no straightforward way to interrupt an arbitrary python function in all contexts. If running in process, we can interrupt the main thread, but not other threads. If running in a subprocess we can kill the process. If running in a dask cluster we can cancel pending work, but not stop running work - but with #2667 the DaskExecutor
can now manage the lifetime of a dask cluster, so we could have the executor shutdown the cluster as a hard stop.
Given the above restrictions, it seems we could add more support for cancelling flow runs with the following semantics:
- Cancellation of a flow run starts when flow runs are transitioned into a
Cancelling
state, whereCancelling
is a subclass ofRunning
. This indicates that the flow run is in the process of cancelling, but still is technicallyRunning
. Transitioning directly intoCancelled
fromRunning
skips a necessary intermediate state and should be deprecated. - Cancellation is still “best effort”:
- If a task is started, we’ll cancel it if possible, otherwise it may run until termination
- If a task hasn’t started, it won’t be started
Focussing on how this works with orchestration, the following takes place after a user cancels a running flow:
- The flow run is moved to
Cancelling
- Once in
Cancelling
,TaskRunner
interactions with cloud/server (either heartbeats or state updates) will let the task runner know that the task should be cancelled. If possible, a running task will be interrupted, otherwise it will be allowed to finish. Tasks that are allowed to finish will transition to their final state as if cancellation hadn’t occurred (e.g.Success
orFailed
). This provides a more accurate representation of the cancelled flow run. Tasks that are successfully cancelled are marked asCancelled
. - Likewise,
FlowRunner
interactions with cloud (either heartbeats or state updates) will let the flow runner know that the flow should be cancelled. Any tasks that haven’t been submitted yet will transition directly toCancelled
.- When the
FlowRunner
notices that a flow run has been cancelled, it will call a new method on theExecutor
to ask the executor to stop any ongoing work if possible. Tentatively I’ll call this methodexecutor.cancel
. For theDaskExecutor
, if the executor is managing a temporary cluster (newly enabled in #2667) the cluster will be shutdown, effectively cancelling any ongoing tasks. Otherwise a warning will be output (that pending tasks can’t be stopped) and any ongoing work completed. For the local executor we might try to interrupt the main thread, or we may just wait for the task to finish. As stated above, cancellation is best-effort, and doesn’t guarantee that ongoing work will be interrupted.
- When the
- Once all tasks in the flow have transitioned to a
Finished
state (Cancelled
being a subclass ofFinished
), the flow run will transition fromCancelling
toCancelled
.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (1 by maintainers)
Good question. I think that cancelling a flow run shouldn’t hard-kill the
FlowRunner
- I think we want a more graceful shutdown. IMO the flow runner should:Executor
to cancel the whole execution if possibleIn the case of a temporary dask cluster, this would shutdown the dask cluster. When the flow runner exits, the initial job (e.g. a fargate task) would exit. Hard killing the
FlowRunner
may not allow sufficient time to cleanup resources, and could leave things in a bad state, I think this should be handled at theExecutor
level.@jcrist For Environments/Executors using AWS Fargate or Kubernetes Jobs, do you think this API would also be able to stop Flow Runs by shutting down the underlying infrastructure? It seems to be a common feature request. See: #2766