Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Max Workers for Parallel Runner

See original GitHub issue

Description

I am working with a very slow database connection to get data migrated to S3. The parallel runner allows me to get up to the number of cpu core connections running in parallel.

Context

I would like to be able to kick it up a notch while waiting on slow sql queries.

Possible Implementation

Parallel Runner Module

As shown int the ProcessPoolExecutor docs max_workers can be passed into the ProcessPoolExecutor

Change Line 34 in the parallel runner module to bring in Optional.

- from typing import Any, Dict
+ from typing import Any, Dict, Optional

Change Line 59-68 in the parallel runner module

-    def __init__(self):
+   def __init__(self, max_workers: Optional[int]=None):
        """Instantiates the runner by creating a Manager.
+    
+     Args:
          max_workers
        """
        self._manager = ParallelRunnerManager()
        self._manager.start()
+     self.max_workers = max_workers

Change line 170 in the parallel runner module to the following

-         with ProcessPoolExecutor() as pool:
+        with ProcessPoolExecutor(max_workers=self.max_workers) as pool:

kedro_cli in template

add MAX_WORKERS_HELP to kedro_cli line 91

+
+MAX_WORKERS_HELP = """Maximum number of process to run in parallel. If max_workers is None or not given, it will default to the number of processors on the machine. If max_workers is lower or equal to 0, then a ValueError will be raised.
In the default template add a flag to the cli in [kedro_cli lines 102-147](https://github.com/quantumblacklabs/kedro/blob/861baa5ee8dd1c8ffce3ef83ae598fa38ecb55e6/kedro/template/%7B%7B%20cookiecutter.repo_name%20%7D%7D/kedro_cli.py#L102-L147)

``` diff
@click.group(context_settings=CONTEXT_SETTINGS, name=__file__)
def cli():
    """Command line tools for manipulating a Kedro project."""


@cli.command()
@click.option("--from-nodes", type=str, default="", help=FROM_NODES_HELP)
@click.option("--to-nodes", type=str, default="", help=TO_NODES_HELP)
@click.option(
    "--node",
    "-n",
    "node_names",
    type=str,
    default=None,
    multiple=True,
    help=NODE_ARG_HELP,
)
@click.option(
    "--runner", "-r", type=str, default=None, multiple=False, help=RUNNER_ARG_HELP
)
@click.option("--parallel", "-p", is_flag=True, multiple=False, help=PARALLEL_ARG_HELP)
+ @click.option("--max-workers", type=int, default=None, multiple=False, help=MAX_WORKERS_HELP)
@click.option("--env", "-e", type=str, default=None, multiple=False, help=ENV_ARG_HELP)
@click.option("--tag", "-t", type=str, default=None, multiple=True, help=TAG_ARG_HELP)
- def run(tag, env, parallel, runner, node_names, to_nodes, from_nodes):
+ def run(tag, env, parallel, runner, max_workers, node_names, to_nodes, from_nodes):
    """Run the pipeline."""
    from {{cookiecutter.python_package}}.run import main
    from_nodes = [n for n in from_nodes.split(",") if n]
    to_nodes = [n for n in to_nodes.split(",") if n]

    if parallel and runner:
        raise KedroCliError(
            "Both --parallel and --runner options cannot be used together. "
            "Please use either --parallel or --runner."
        )
+
+  if runner and max_workers != None:
+     raise KedroCliError(
+        "Both --runner and --max-workers options cannot be used together."
+        "Please use either --parallel with --max-workers or --runner."

    if parallel:
        runner = "ParallelRunner"
    runner_class = load_obj(runner, "kedro.runner") if runner else SequentialRunner
+
+   runner_kwargs = {}
+   if parallel and max_workers !=None:
+      runner_kwargs['max_workers'] = max_workers

    main(
        tags=tag,
        env=env,
-         runner=runner_class(),
+         runner=runner_class(**runner_kwargs), # i am not familiar with load_obj, so I am not completely sure this is correct
        node_names=node_names,
        from_nodes=from_nodes,
        to_nodes=to_nodes,
    )

Docs

04_create_pipelines.md line 727

- * `ParallelRunner` - runs your nodes in parallel; independent nodes are able to run at the same time, allowing you to take advantage of multiple CPU cores.
+ * `ParallelRunner` - runs your nodes in parallel; independent nodes are able to run at the same time, allowing you to take advantage of multiple CPU cores as set by `max_workers`.

05_nodes_and_pipelines.md line 526

+ > *Note: * the parallel runner can be called with --max-workers to set the maximum number of worker processes.

004_create_pipelines.md line 737

+ > *Note: * the parallel runner can be called with --max-workers to set the maximum number of worker processes.

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

921kiyocommented, Oct 7, 2019

@WaylonWalker We have added this in https://github.com/quantumblacklabs/kedro/commit/4dc48e7b17104356934b5a7cb91c760a739fe44e (and we are releasing the new version soon). I am closing this issue but let us know if you have any questions 😃

1reaction

WaylonWalkercommented, Sep 11, 2019

@Flid Thanks for the heads up. I’ll make sure to revert that part. Our team will be using a custom cookiecutter based on kedro new with a few customizations for us. I can make the change there.

I was able to easily make the change, testing it has proven tricky as my current pipeline is in 14.3 and several things seemed broken when I tried to use my branch of kedro on it. Nothing major just need a bit of time to make the upgrade and test it out.