Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Correct Way to Setup PyTest Fixture

See original GitHub issue

Currently, I have several test files being executed that all require using a dask client that is being setup via a PyTest fixture:

from dask.distributed import Client, LocalCluster
import pytest


@pytest.fixture(scope="module")
def dask_client():
    cluster = LocalCluster(n_workers=2, threads_per_worker=2)
    client = Client(cluster)
    yield client
    # teardown
    client.close()
    cluster.close()

This exists at the top of each test file and then, the dask_client is accessed with:

def test_one(dask_client):
    ...

def test_two(dask_client):
    ...

def test_three(dask_client):
    ...

Based on my reading of the PyTest documentation, it is my understanding that the dask_client is created once at the start of the execution of the test file (with scope="module"), each test within the test file is executed, and then the dask_client is torn down before the next test file (that also requires a dask_client) does the same thing.

Since the LocalCluster is initially setup with n_workers=2, threads_per_worker=2, I naively expected the maximum number of cores to be 2 and the number of threads per core to also be 2. However, according to the Activity Monitor on my 13" Macbook Pro, I see the number of threads climb to 16 for one process:

threads

Note that I don’t have any other Python processes running. All of the Python processes shown in the image appear to be the result of tests starting/stopping and the dask_client tear down is catching up. However, occasionally, by simply re-running the exact same test suite multiple times, we’ll encounter a CancelledError:

../../miniconda3/lib/python3.7/site-packages/distributed/client.py:1885: in gather
    asynchronous=asynchronous,
../../miniconda3/lib/python3.7/site-packages/distributed/client.py:767: in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
../../miniconda3/lib/python3.7/site-packages/distributed/utils.py:345: in sync
    raise exc.with_traceback(tb)
../../miniconda3/lib/python3.7/site-packages/distributed/utils.py:329: in f
    result[0] = yield future
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <tornado.gen.Runner object at 0x1c27e07860>

    def run(self) -> None:
        """Starts or resumes the generator, running until it reaches a
        yield point that is not ready.
        """
        if self.running or self.finished:
            return
        try:
            self.running = True
            while True:
                future = self.future
                if future is None:
                    raise Exception("No pending future")
                if not future.done():
                    return
                self.future = None
                try:
                    exc_info = None

                    try:
>                       value = future.result()
E                       concurrent.futures._base.CancelledError

../../miniconda3/lib/python3.7/site-packages/tornado/gen.py:735: CancelledError
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================================================ 1 failed, 175 passed in 52.44s ============================================================
Error: pytest encountered exit code 1

Based on my past experience, a CancelledError is common when running a distributed cluster and when there are differences in Python packages installed. However, in this case, we are running a LocalCluster and it appears that all of the resources are being used up and tornado is hanging. Again, the CancelledError happens sporadically when I re-run the exact same test suite multiple times.

I’m guessing that I’m doing things incorrectly or my assumptions are incorrect. Is there a correct/proper way to use Dask LocalCluster with PyTest so that all tests are limited to only 2 cores and 2 threads per core (instead of getting up to 16 threads)?

Initially, a hacky way around this was to limit the total number of tests within each test file which resulted in a test suite with many separate test files (that would each setup/tear down its own dask_client) but with only a handful of tests in each test file. This seemed to help ensure that the number of threads being used wouldn’t keep climbing. However, this solution is no longer sufficient and I’m still seeing the same CancelledError as my test suite grows. I’ve also tried adding cluster restarts inbetween tests, adding a few seconds of sleep time after tear down, or setting up/tearing down dask_client at the test level but this significantly slows down the execution of the test suite.

The test suite can be found here

Issue Analytics

State:
Created 4 years ago
Comments:17 (9 by maintainers)

Top GitHub Comments

4reactions

jmoralezcommented, Apr 6, 2021

@seanlaw your approach of defining a cluster fixture instead of using the client one is brilliant. I proposed adopting that in https://github.com/microsoft/LightGBM/pull/4159 which was merged today and reduced the CI time from 20 minutes to 3 minutes. The folks at xgboost are looking into adopting it as well (https://github.com/dmlc/xgboost/issues/6816).

I think this approach should be in https://distributed.dask.org/en/latest/develop.html#writing-tests

1reaction

TomAugspurgercommented, Mar 4, 2020

It may be worth making gen_cluster a proper pytest fixture or mark so that we play more nicely with parametrize and others. I’m not really familiar with how that’s done though.