Environments and multi-task persistent state
See original GitHub issueI frequently use GPU clusters, and for these workloads I find that I need to do some relatively expensive setup or teardown operations to establish a valid GPU computing context on each worker before I start the distributed processing of my data. There are various ways to “trick” distributed
workers into maintaining a stateful execution environment for multiple jobs, probably the best of which is to use global variable in the Python interpreter itself. However, this solution (and others I’m aware of) do not provide particularly fine grained control over the distributed execution environment. It would be very useful if distributed
could provide some way to explicitly manage execution “environments” in which jobs could be explicitly run.
I’m imagining something along these lines. I would call
my_gpu_env = executor.environment(my_setup_function(), my_teardown_function())
which would cause each worker to run my_setup_function() and then return a reference to its copy of the new environment. The scheduler would record these and send back a single reference to the collection of environments on the workers, my_new_env in this example. Subsequent calls to run distributed functions could specify the environment in which they would like to be run:
executor.map( some_function_using_the_gpu, data, environment = my_gpu_env )
The scheduler could keep track of the setup()
and teardown()
functions associated with each environment. Then, if a new worker comes online and is asked to run a function in an environment that it has not yet set up, it could request the necessary initialization routine from the scheduler and run that first before running any jobs.
This is a somewhat rough sketch of what would be desirable here, and I’m curious to start a discussion here to see if there are other users out there that might also want a feature like this. In particular, are there others using distributed to manage a cluster of GPU nodes? How do you manage a cluster-wide execution context?
Issue Analytics
- State:
- Created 8 years ago
- Comments:28 (19 by maintainers)
Top GitHub Comments
If you want to use graphs explicitly you would need to modify your graph to point to the
"data"
key rather than to the object itself.Before
After
But really I would just do the following:
@thompson42 - how did you solve this for your use case?