Recursive tasks
See original GitHub issueUse Case
A lot of the work we do is crawling / spidering datasets and services. So we have some kind of tree to walk - a simple example might be a website that exposes a rest service mimicking something like a directory structure (files in folders in folders recursively). But really this extends to any kind of source that contains embedded objects.
If you want to process all files you need to scan the root folder, this returns a list of sub-folders and files. The files can get extracted and processed, and the sub-folders each need to be recursively scanned.
Solution
I imagine some ideal solution where you can call a function recursively to “fan-out” over a dataset / service, so something like the following:
process_file_task.map(file=get_files_task.recurse(endpoint=static("https://bla.."), folder="/"))
So then I imagine the get_files_task
would return (folders: List, files: List) and the .recurse
would see the endpoint parameter as non-recursive (similar to unmapped), and then see one dynamic parameter, and therefore map the first output to this input and take other outputs as the actual output if that makes sense.
I’m not sure how the DFE described in #2041 will actually work, but if it works somewhat as I imagine then a dream scenario might be that the .recurse
in the example above queues recursive get_files_task
concurrently (so one task can create multiple simultaneously running copies of itself) and the .map
processes results as each recursive task returns.
Alternatives
We are currently using 2 approaches to deal with situations such as the one above:
-
If the data gathered by the recursive function is not huge (as in we can hold it all in one worker instance memory) then we do a recursive collect inside the task. Unfortunately, this is error prone (can’t use the prefect retry mechanisms) and leads to large long-running tasks as the task can’t run fetches concurrently and can’t spread over multiple workers.
-
If the data returned by the recursive function is large or we need quick ingest, we have built a wrapper around flow.run that starts flows in a concurrent future and monitors the flow for data that can be passed to a new flow, it looks a bit like this:
thread_pool_executor = ThreadPoolExecutor(10)
futures: List[Future] = [
thread_pool_executor.submit(
self._flow.run, parameters=parameters, executor=self._executor
)
]
handled_new_parameters: List[Dict] = []
recursion_countdown = 3
while any([future.running() for future in futures]):
for task in self._flow.tasks:
try:
new_parameters = task.start_new_flow_parameters
if new_parameters not in handled_new_parameters:
self._logger.debug(f"Found start flow parameters: {new_parameters}")
handled_new_parameters.append(new_parameters)
if recursion_countdown > 0:
recursion_countdown -= 1
combined_parameters = {
**parameters,
**{"recursive_parameters": new_parameters},
}
futures.append(
thread_pool_executor.submit(
self._flow.run,
parameters=combined_parameters,
executor=self._executor,
)
)
except AttributeError:
pass
time.sleep(0.01)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:5 (1 by maintainers)
Hi @jacques- – FWIW now that depth first execution for mapping is through in the forthcoming 0.12.0, we are reevaluating priority of other features that were effectively blocked by that refactor, and
flat_map
is one of them and is comparatively pretty high. We don’t have a specific timeline but just wanted to give you the heads up 😃This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it.