[distributed] Improper shutdown of remote service
See original GitHub issueI’m still investigating a standard error that occurs on most runs with SageMaker:
Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 20, in <module>
subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
File "/usr/local/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['train']' died with <Signals.SIGTERM: 15>.
After some investigation it looks like:
- the SIGTERM is launched from
Service.shutdown
as per
This is not good.
The clean termination procedure is the atexit
call
that calls _close_global_remote_services
which does not call Service.shutdown
, this is OK. However it is only called upon clean termination, not upon a SIGTERM as stated in the docs https://docs.python.org/3.6/library/atexit.html#module-atexit
So this os.killpg
does not seem to be the right tool for the job.
Possible workaround
I think here the process should be killed using proc.kill
(https://docs.python.org/3/library/subprocess.html#subprocess.Popen.kill)
I’ll try and experiment with this and report (with a PR if things go well) but it’d be great to have someone who’s more familiar with Dask to have a think about this and get this fixed reasonably soon.
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (5 by maintainers)
I don’t think that’s doable publicly at this point. But I’m willing to communicate internally with someone who would want to dig into this + experiment with possible ways out of the issue & report.
Closing this, please re-open if this issue still exists.