`WorkerProcess` leaks environment variables to parent process
See original GitHub issueSince https://github.com/dask/distributed/pull/6681, WorkerProcess
leaks the environment specified via the env
kwarg, for example the CUDA_VISIBLE_DEVICES
variable we use in Dask-CUDA.
Before https://github.com/dask/distributed/pull/6681
In [1]: import os
In [2]: from dask_cuda import LocalCUDACluster
In [3]: os.environ.get("CUDA_VISIBLE_DEVICES")
In [4]: cluster = LocalCUDACluster()
/datasets/pentschev/src/distributed/distributed/node.py:179: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 43355 instead
warnings.warn(
2022-07-20 11:37:39,518 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,518 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,519 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,519 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,525 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,526 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,542 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,542 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,548 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,548 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,548 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,549 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,551 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,551 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,551 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,552 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
In [5]: os.environ.get("CUDA_VISIBLE_DEVICES")
In [6]:
After https://github.com/dask/distributed/pull/6681
In [1]: import os
In [2]: from dask_cuda import LocalCUDACluster
In [3]: os.environ.get("CUDA_VISIBLE_DEVICES")
In [4]: cluster = LocalCUDACluster()
/datasets/pentschev/src/distributed/distributed/node.py:179: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 39759 instead
warnings.warn(
2022-07-20 11:37:00,532 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,533 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,535 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,536 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,607 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,607 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,661 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,662 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,662 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,662 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,663 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,664 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,666 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,666 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,742 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,742 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
In [5]: os.environ.get("CUDA_VISIBLE_DEVICES")
Out[5]: '7,0,1,2,3,4,5,6'
In [6]:
What happens now is that os.environ.updated(self.env)
is called from the parent process and never reverted. One of the issues this causes is leaking environment variables between pytests. Furthermore, if multiple workers are created they may overwrite each other’s variables (I’m not sure if a cluster can create WorkerProcess
es with different environment variables, so this may be a non-issue).
This problem has been discussed in length in the past in https://github.com/dask/distributed/issues/3682, which is a difficult problem to tackle from Python given any newly-spawned process must inherit environment variables from the parent process. One of the suggestions in https://github.com/dask/distributed/issues/3682#issuecomment-612078761 was to create a lock to ensure multiple workers don’t spawn simultaneously, which will likely increase a bit the spawn time but seems to be the only safe option in that situation.
Any thoughts here @crusaderky (original author of #6681)?
cc’ing @quasiben @kkraus14 @mrocklin for vis as well, who were active on the https://github.com/dask/distributed/issues/3682 discussion.
Issue Analytics
- State:
- Created a year ago
- Comments:18 (12 by maintainers)
Top GitHub Comments
To be clear, I’m definitely not suggesting reverting and leaving it reverted. I was only suggesting reverting for today to make the release, then in the next week or two adding a different solution we’re all happy with (like @crusaderky’s proposal). It feels like a safer path to me, since we know it won’t break things for other users using env vars in similar ways, even though it would delay getting
MALLOC_TRIM_THRESHOLD_
in the hands of users even longer, which I’d be sad about.I think I was now able to work around that in https://github.com/rapidsai/dask-cuda/pull/955 . I’ll just wait for confirmation until tomorrow morning, but unless some other problem emerges regarding that we should be fine with the release going out as is.