question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use Case

A lot of the work we do is crawling / spidering datasets and services. So we have some kind of tree to walk - a simple example might be a website that exposes a rest service mimicking something like a directory structure (files in folders in folders recursively). But really this extends to any kind of source that contains embedded objects.

If you want to process all files you need to scan the root folder, this returns a list of sub-folders and files. The files can get extracted and processed, and the sub-folders each need to be recursively scanned.

Solution

I imagine some ideal solution where you can call a function recursively to “fan-out” over a dataset / service, so something like the following:

process_file_task.map(file=get_files_task.recurse(endpoint=static("https://bla.."), folder="/"))

So then I imagine the get_files_task would return (folders: List, files: List) and the .recurse would see the endpoint parameter as non-recursive (similar to unmapped), and then see one dynamic parameter, and therefore map the first output to this input and take other outputs as the actual output if that makes sense.

I’m not sure how the DFE described in #2041 will actually work, but if it works somewhat as I imagine then a dream scenario might be that the .recurse in the example above queues recursive get_files_task concurrently (so one task can create multiple simultaneously running copies of itself) and the .map processes results as each recursive task returns.

Alternatives

We are currently using 2 approaches to deal with situations such as the one above:

  1. If the data gathered by the recursive function is not huge (as in we can hold it all in one worker instance memory) then we do a recursive collect inside the task. Unfortunately, this is error prone (can’t use the prefect retry mechanisms) and leads to large long-running tasks as the task can’t run fetches concurrently and can’t spread over multiple workers.

  2. If the data returned by the recursive function is large or we need quick ingest, we have built a wrapper around flow.run that starts flows in a concurrent future and monitors the flow for data that can be passed to a new flow, it looks a bit like this:

        thread_pool_executor = ThreadPoolExecutor(10)
        futures: List[Future] = [
            thread_pool_executor.submit(
                self._flow.run, parameters=parameters, executor=self._executor
            )
        ]
        handled_new_parameters: List[Dict] = []
        recursion_countdown = 3
        while any([future.running() for future in futures]):
            for task in self._flow.tasks:
                try:
                    new_parameters = task.start_new_flow_parameters
                    if new_parameters not in handled_new_parameters:
                        self._logger.debug(f"Found start flow parameters: {new_parameters}")
                        handled_new_parameters.append(new_parameters)
                        if recursion_countdown > 0:
                            recursion_countdown -= 1
                            combined_parameters = {
                                **parameters,
                                **{"recursive_parameters": new_parameters},
                            }
                            futures.append(
                                thread_pool_executor.submit(
                                    self._flow.run,
                                    parameters=combined_parameters,
                                    executor=self._executor,
                                )
                            )
                except AttributeError:
                    pass
            time.sleep(0.01)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
lauralorenzcommented, Jun 16, 2020

Hi @jacques- – FWIW now that depth first execution for mapping is through in the forthcoming 0.12.0, we are reevaluating priority of other features that were effectively blocked by that refactor, and flat_map is one of them and is comparatively pretty high. We don’t have a specific timeline but just wanted to give you the heads up 😃

0reactions
github-actions[bot]commented, Dec 3, 2022

This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Recursive Practice Problems with Solutions - GeeksforGeeks
How to solve problems related to Number-Digits using Recursion? 2. Top 50 Array Coding Problems for Interviews. 3. Recursive program for prime ...
Read more >
RecursiveTask (Java Platform SE 8 ) - Oracle Help Center
A recursive result-bearing ForkJoinTask . For a classic example, here is a task computing Fibonacci numbers: class Fibonacci extends RecursiveTask<Integer> ...
Read more >
Recursive task spawning and pitfalls - YouTube
Recursive task spawning and pitfalls. 2.8K views 5 years ago. Introduction to Parallel Programming in OpenMP. Introduction to Parallel ...
Read more >
Recursive task and calendar - Tips and Tricks
I created a task that should be completed every Thursday. Why do I see that it is scheduled for every day?
Read more >
Recursive workload - IBM
The IBM Spectrum Symphony recursive workload feature allows continuous ... For example, a parent task may depend on data that results from a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found