question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Max Workers for Parallel Runner

See original GitHub issue

Description

I am working with a very slow database connection to get data migrated to S3. The parallel runner allows me to get up to the number of cpu core connections running in parallel.

Context

I would like to be able to kick it up a notch while waiting on slow sql queries.

Possible Implementation

Parallel Runner Module

As shown int the ProcessPoolExecutor docs max_workers can be passed into the ProcessPoolExecutor

Change Line 34 in the parallel runner module to bring in Optional.

- from typing import Any, Dict
+ from typing import Any, Dict, Optional

Change Line 59-68 in the parallel runner module

-    def __init__(self):
+   def __init__(self, max_workers: Optional[int]=None):
        """Instantiates the runner by creating a Manager.
+    
+     Args:
          max_workers
        """
        self._manager = ParallelRunnerManager()
        self._manager.start()
+     self.max_workers = max_workers

Change line 170 in the parallel runner module to the following

-         with ProcessPoolExecutor() as pool:
+        with ProcessPoolExecutor(max_workers=self.max_workers) as pool:

kedro_cli in template

add MAX_WORKERS_HELP to kedro_cli line 91

+
+MAX_WORKERS_HELP = """Maximum number of process to run in parallel. If max_workers is None or not given, it will default to the number of processors on the machine. If max_workers is lower or equal to 0, then a ValueError will be raised.
In the default template add a flag to the cli in [kedro_cli lines 102-147](https://github.com/quantumblacklabs/kedro/blob/861baa5ee8dd1c8ffce3ef83ae598fa38ecb55e6/kedro/template/%7B%7B%20cookiecutter.repo_name%20%7D%7D/kedro_cli.py#L102-L147)

``` diff
@click.group(context_settings=CONTEXT_SETTINGS, name=__file__)
def cli():
    """Command line tools for manipulating a Kedro project."""


@cli.command()
@click.option("--from-nodes", type=str, default="", help=FROM_NODES_HELP)
@click.option("--to-nodes", type=str, default="", help=TO_NODES_HELP)
@click.option(
    "--node",
    "-n",
    "node_names",
    type=str,
    default=None,
    multiple=True,
    help=NODE_ARG_HELP,
)
@click.option(
    "--runner", "-r", type=str, default=None, multiple=False, help=RUNNER_ARG_HELP
)
@click.option("--parallel", "-p", is_flag=True, multiple=False, help=PARALLEL_ARG_HELP)
+ @click.option("--max-workers", type=int, default=None, multiple=False, help=MAX_WORKERS_HELP)
@click.option("--env", "-e", type=str, default=None, multiple=False, help=ENV_ARG_HELP)
@click.option("--tag", "-t", type=str, default=None, multiple=True, help=TAG_ARG_HELP)
- def run(tag, env, parallel, runner, node_names, to_nodes, from_nodes):
+ def run(tag, env, parallel, runner, max_workers, node_names, to_nodes, from_nodes):
    """Run the pipeline."""
    from {{cookiecutter.python_package}}.run import main
    from_nodes = [n for n in from_nodes.split(",") if n]
    to_nodes = [n for n in to_nodes.split(",") if n]

    if parallel and runner:
        raise KedroCliError(
            "Both --parallel and --runner options cannot be used together. "
            "Please use either --parallel or --runner."
        )
+
+  if runner and max_workers != None:
+     raise KedroCliError(
+        "Both --runner and --max-workers options cannot be used together."
+        "Please use either --parallel with --max-workers or --runner."

    if parallel:
        runner = "ParallelRunner"
    runner_class = load_obj(runner, "kedro.runner") if runner else SequentialRunner
+
+   runner_kwargs = {}
+   if parallel and max_workers !=None:
+      runner_kwargs['max_workers'] = max_workers

    main(
        tags=tag,
        env=env,
-         runner=runner_class(),
+         runner=runner_class(**runner_kwargs), # i am not familiar with load_obj, so I am not completely sure this is correct
        node_names=node_names,
        from_nodes=from_nodes,
        to_nodes=to_nodes,
    )

Docs

04_create_pipelines.md line 727

- * `ParallelRunner` - runs your nodes in parallel; independent nodes are able to run at the same time, allowing you to take advantage of multiple CPU cores.
+ * `ParallelRunner` - runs your nodes in parallel; independent nodes are able to run at the same time, allowing you to take advantage of multiple CPU cores as set by `max_workers`.

05_nodes_and_pipelines.md line 526

+ > *Note: * the parallel runner can be called with --max-workers to set the maximum number of worker processes.

004_create_pipelines.md line 737

+ > *Note: * the parallel runner can be called with --max-workers to set the maximum number of worker processes.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
921kiyocommented, Oct 7, 2019

@WaylonWalker We have added this in https://github.com/quantumblacklabs/kedro/commit/4dc48e7b17104356934b5a7cb91c760a739fe44e (and we are releasing the new version soon). I am closing this issue but let us know if you have any questions 😃

1reaction
WaylonWalkercommented, Sep 11, 2019

@Flid Thanks for the heads up. I’ll make sure to revert that part. Our team will be using a custom cookiecutter based on kedro new with a few customizations for us. I can make the change there.

I was able to easily make the change, testing it has proven tricky as my current pipeline is in 14.3 and several things seemed broken when I tried to use my branch of kedro on it. Nothing major just need a bit of time to make the upgrade and test it out.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parallelism and sharding | Playwright
You can control the maximum number of parallel worker processes via command line or in the configuration file. ... In the configuration file:...
Read more >
Max Workers for Parallel Runner · Issue #88 · kedro ... - GitHub
Description I am working with a very slow database connection to get data migrated to S3. The parallel runner allows me to get...
Read more >
Increasing max parallel workers per gather in Postgres
The first setting you're likely to be limited by is the max_parallel_workers_per_gather parameter, which is only two by default. This means that ...
Read more >
What's an easy way to tell how many Playwright Tests I can ...
This command will kick off your suite of playwright tests with different workers passed in (iterating through min/max). The output will be ...
Read more >
What is limiting the number of active workers in a Gradle build?
--max-workers is not only for tests execution but also for parallel project execution. In the example, however, you are limiting tests ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found