Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Recursive tasks

See original GitHub issue

Use Case

A lot of the work we do is crawling / spidering datasets and services. So we have some kind of tree to walk - a simple example might be a website that exposes a rest service mimicking something like a directory structure (files in folders in folders recursively). But really this extends to any kind of source that contains embedded objects.

If you want to process all files you need to scan the root folder, this returns a list of sub-folders and files. The files can get extracted and processed, and the sub-folders each need to be recursively scanned.

Solution

I imagine some ideal solution where you can call a function recursively to “fan-out” over a dataset / service, so something like the following:

process_file_task.map(file=get_files_task.recurse(endpoint=static("https://bla.."), folder="/"))

So then I imagine the get_files_task would return (folders: List, files: List) and the .recurse would see the endpoint parameter as non-recursive (similar to unmapped), and then see one dynamic parameter, and therefore map the first output to this input and take other outputs as the actual output if that makes sense.

I’m not sure how the DFE described in #2041 will actually work, but if it works somewhat as I imagine then a dream scenario might be that the .recurse in the example above queues recursive get_files_task concurrently (so one task can create multiple simultaneously running copies of itself) and the .map processes results as each recursive task returns.

Alternatives

We are currently using 2 approaches to deal with situations such as the one above:

If the data gathered by the recursive function is not huge (as in we can hold it all in one worker instance memory) then we do a recursive collect inside the task. Unfortunately, this is error prone (can’t use the prefect retry mechanisms) and leads to large long-running tasks as the task can’t run fetches concurrently and can’t spread over multiple workers.
If the data returned by the recursive function is large or we need quick ingest, we have built a wrapper around flow.run that starts flows in a concurrent future and monitors the flow for data that can be passed to a new flow, it looks a bit like this:

        thread_pool_executor = ThreadPoolExecutor(10)
        futures: List[Future] = [
            thread_pool_executor.submit(
                self._flow.run, parameters=parameters, executor=self._executor
            )
        ]
        handled_new_parameters: List[Dict] = []
        recursion_countdown = 3
        while any([future.running() for future in futures]):
            for task in self._flow.tasks:
                try:
                    new_parameters = task.start_new_flow_parameters
                    if new_parameters not in handled_new_parameters:
                        self._logger.debug(f"Found start flow parameters: {new_parameters}")
                        handled_new_parameters.append(new_parameters)
                        if recursion_countdown > 0:
                            recursion_countdown -= 1
                            combined_parameters = {
                                **parameters,
                                **{"recursive_parameters": new_parameters},
                            }
                            futures.append(
                                thread_pool_executor.submit(
                                    self._flow.run,
                                    parameters=combined_parameters,
                                    executor=self._executor,
                                )
                            )
                except AttributeError:
                    pass
            time.sleep(0.01)

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:5 (1 by maintainers)

Top GitHub Comments

2reactions

lauralorenzcommented, Jun 16, 2020

Hi @jacques- – FWIW now that depth first execution for mapping is through in the forthcoming 0.12.0, we are reevaluating priority of other features that were effectively blocked by that refactor, and flat_map is one of them and is comparatively pretty high. We don’t have a specific timeline but just wanted to give you the heads up 😃

0reactions

github-actions[bot]commented, Dec 3, 2022

This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it.

Top Results From Across the Web

Recursive Practice Problems with Solutions - GeeksforGeeks

How to solve problems related to Number-Digits using Recursion? 2. Top 50 Array Coding Problems for Interviews. 3. Recursive program for prime ...

RecursiveTask (Java Platform SE 8 ) - Oracle Help Center

A recursive result-bearing ForkJoinTask . For a classic example, here is a task computing Fibonacci numbers: class Fibonacci extends RecursiveTask<Integer> ...

Recursive task spawning and pitfalls - YouTube

Recursive task spawning and pitfalls. 2.8K views 5 years ago. Introduction to Parallel Programming in OpenMP. Introduction to Parallel ...

Recursive task and calendar - Tips and Tricks

I created a task that should be completed every Thursday. Why do I see that it is scheduled for every day?

Recursive workload - IBM

The IBM Spectrum Symphony recursive workload feature allows continuous ... For example, a parent task may depend on data that results from a...