Correct Way to Setup PyTest Fixture
See original GitHub issueCurrently, I have several test files being executed that all require using a dask client that is being setup via a PyTest fixture:
from dask.distributed import Client, LocalCluster
import pytest
@pytest.fixture(scope="module")
def dask_client():
cluster = LocalCluster(n_workers=2, threads_per_worker=2)
client = Client(cluster)
yield client
# teardown
client.close()
cluster.close()
This exists at the top of each test file and then, the dask_client
is accessed with:
def test_one(dask_client):
...
def test_two(dask_client):
...
def test_three(dask_client):
...
Based on my reading of the PyTest documentation, it is my understanding that the dask_client
is created once at the start of the execution of the test file (with scope="module"
), each test within the test file is executed, and then the dask_client
is torn down before the next test file (that also requires a dask_client
) does the same thing.
Since the LocalCluster
is initially setup with n_workers=2, threads_per_worker=2
, I naively expected the maximum number of cores to be 2 and the number of threads per core to also be 2. However, according to the Activity Monitor on my 13" Macbook Pro, I see the number of threads climb to 16 for one process:
Note that I don’t have any other Python processes running. All of the Python processes shown in the image appear to be the result of tests starting/stopping and the dask_client
tear down is catching up. However, occasionally, by simply re-running the exact same test suite multiple times, we’ll encounter a CancelledError
:
../../miniconda3/lib/python3.7/site-packages/distributed/client.py:1885: in gather
asynchronous=asynchronous,
../../miniconda3/lib/python3.7/site-packages/distributed/client.py:767: in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
../../miniconda3/lib/python3.7/site-packages/distributed/utils.py:345: in sync
raise exc.with_traceback(tb)
../../miniconda3/lib/python3.7/site-packages/distributed/utils.py:329: in f
result[0] = yield future
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <tornado.gen.Runner object at 0x1c27e07860>
def run(self) -> None:
"""Starts or resumes the generator, running until it reaches a
yield point that is not ready.
"""
if self.running or self.finished:
return
try:
self.running = True
while True:
future = self.future
if future is None:
raise Exception("No pending future")
if not future.done():
return
self.future = None
try:
exc_info = None
try:
> value = future.result()
E concurrent.futures._base.CancelledError
../../miniconda3/lib/python3.7/site-packages/tornado/gen.py:735: CancelledError
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================================================ 1 failed, 175 passed in 52.44s ============================================================
Error: pytest encountered exit code 1
Based on my past experience, a CancelledError
is common when running a distributed cluster and when there are differences in Python packages installed. However, in this case, we are running a LocalCluster
and it appears that all of the resources are being used up and tornado
is hanging. Again, the CancelledError
happens sporadically when I re-run the exact same test suite multiple times.
I’m guessing that I’m doing things incorrectly or my assumptions are incorrect. Is there a correct/proper way to use Dask LocalCluster
with PyTest so that all tests are limited to only 2 cores and 2 threads per core (instead of getting up to 16 threads)?
Initially, a hacky way around this was to limit the total number of tests within each test file which resulted in a test suite with many separate test files (that would each setup/tear down its own dask_client
) but with only a handful of tests in each test file. This seemed to help ensure that the number of threads being used wouldn’t keep climbing. However, this solution is no longer sufficient and I’m still seeing the same CancelledError
as my test suite grows. I’ve also tried adding cluster restarts inbetween tests, adding a few seconds of sleep time after tear down, or setting up/tearing down dask_client
at the test level but this significantly slows down the execution of the test suite.
The test suite can be found here
Issue Analytics
- State:
- Created 4 years ago
- Comments:17 (9 by maintainers)
Top GitHub Comments
@seanlaw your approach of defining a cluster fixture instead of using the client one is brilliant. I proposed adopting that in https://github.com/microsoft/LightGBM/pull/4159 which was merged today and reduced the CI time from 20 minutes to 3 minutes. The folks at xgboost are looking into adopting it as well (https://github.com/dmlc/xgboost/issues/6816).
I think this approach should be in https://distributed.dask.org/en/latest/develop.html#writing-tests
It may be worth making
gen_cluster
a proper pytest fixture or mark so that we play more nicely with parametrize and others. I’m not really familiar with how that’s done though.