Queue support for DaskExecutor using Dask Worker Resources
See original GitHub issueCurrently airflow’s DaskExecutor does not support specifying queues for tasks, due to dask’s lack of an explicit queue specification feature. However, this can be reliably mimicked using dask resources (details here). So the set up would look something like this:
# starting dask worker that can service airflow tasks submitted with queue=queue_name_1 or queue_name_2
$ dask-worker <address> --resources "queue_name_1=inf, queue_name_2=inf"
(Unfortunately as far as I know you need to provide a finite resource limit for the workers, so you’d need to provide an arbitrarily large limit, but I think it’s worth the minor inconvenience to allow a queue functionality in the dask executor.)
# airflow/executors/dask_executor.py
def execute_async(
self,
key: TaskInstanceKey,
command: CommandType,
queue: Optional[str] = None,
executor_config: Optional[Any] = None,
) -> None:
self.validate_command(command)
def airflow_run():
return subprocess.check_call(command, close_fds=True)
if not self.client:
raise AirflowException(NOT_STARTED_MESSAGE)
################ change made here #################
resources = None
if queue:
resources = {queue: 1}
future = self.client.submit(airflow_run, pure=False, resources=resources)
self.futures[future] = key # type: ignore
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
[airflow] branch main updated: Queue support for DaskExecutor ...
This is an automated email from the ASF dual-hosted git repository. potiuk pushed a commit to branch main in repository ...
Read more >Dask Executor — Airflow Documentation
The DaskExecutor implements queues using Dask Worker Resources functionality. To enable the use of queues, start your Dask workers with resources of the ......
Read more >Dask cluster worker specs based on task resource requirements
The DaskExecutor here will reach out to AWS and create a FargateCluster. ... I see that SpecCluster can support several worker specs through...
Read more >Worker Resources — Dask.distributed 2022.12.1 documentation
In this case we want to balance tasks across the cluster with these resource constraints in mind, allocating GPU-constrained tasks to GPU-enabled workers....
Read more >How this works - Dask-Jobqueue
Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@fjetter thanks for the insight. I think worker resources is the best way forward. Since it lets you tag your workers at creation time and then dispatch your airflow tasks based on that tagged name (i.e. queue name), without needing to keep track of explicit workers within airflow. Also it turns out there is a way to define infinite worker resources in dask workers (https://github.com/dask/distributed/discussions/5010#discussioncomment-971219), so this will let you define the resource on the worker without having to provide an arbitrarily large limit, or having to worry about how many tasks could possibly run concurrently on your worker.
I’m not familiar enough with the queue functionality of airflow to know what the expected behaviour should be. In dask we have broadly speaking two ti three mechanism to limit concurrency on task level and/or control assignments to workers.
If you want to limit the number of assigned tasks, i.e. want to ensure that tasks are not yet assigned to a worker before it is allowed to be executed, resources are the way to go.
If you want to control which workers are allowed to work on a given task, the
workers
keyword might be a better fit but that doesn’t control concurrency (other than the intrinsic limit a single worker exposes)If you want to ensure that only a limited number of tasks is executed but it is fine for them to be assigned to a worker and may even block a worker, we have a Semaphore which could be used.
Which is the best to pick depends on how queuing in airflow is supposed to work