question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[distributed] Improper shutdown of remote service

See original GitHub issue

cc: @zhanghang1989 @mseeger

I’m still investigating a standard error that occurs on most runs with SageMaker:

Traceback (most recent call last):
  File "/usr/local/bin/dockerd-entrypoint.py", line 20, in <module>
    subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
  File "/usr/local/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['train']' died with <Signals.SIGTERM: 15>.

After some investigation it looks like:

  1. the SIGTERM is launched from Service.shutdown as per

https://github.com/awslabs/autogluon/blob/04866e5a976c471df1c9272b9be15132f986ead9/autogluon/scheduler/remote/remote.py#L52-L53

This is not good.

The clean termination procedure is the atexit call

https://github.com/awslabs/autogluon/blob/04866e5a976c471df1c9272b9be15132f986ead9/autogluon/scheduler/remote/remote.py#L165

that calls _close_global_remote_services which does not call Service.shutdown, this is OK. However it is only called upon clean termination, not upon a SIGTERM as stated in the docs https://docs.python.org/3.6/library/atexit.html#module-atexit

So this os.killpg does not seem to be the right tool for the job.

Possible workaround

I think here the process should be killed using proc.kill (https://docs.python.org/3/library/subprocess.html#subprocess.Popen.kill)

I’ll try and experiment with this and report (with a PR if things go well) but it’d be great to have someone who’s more familiar with Dask to have a think about this and get this fixed reasonably soon.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
tlienartcommented, Apr 23, 2020

I don’t think that’s doable publicly at this point. But I’m willing to communicate internally with someone who would want to dig into this + experiment with possible ways out of the issue & report.

0reactions
Innixmacommented, Feb 14, 2021

Closing this, please re-open if this issue still exists.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[distributed] Improper shutdown of remote service #433 - GitHub
I'm still investigating a standard error that occurs on most runs with SageMaker: Traceback (most recent call last): File "/usr/local/bin/ ...
Read more >
Error Code 173 - Sigma DS1, DS2, DS3, DS4 - Entrust
Improper shutdown. Severity. Critical. Recovery text. The printer was not shut down properly. Press and hold the User button to set the printer...
Read more >
distributed denial-of-service (DDoS) attack - TechTarget
Here, the application services or databases get overloaded with a high volume of application calls. The inundation of packets causes a denial of...
Read more >
What is a distributed denial-of-service (DDoS) attack?
Once a botnet has been established, the attacker is able to direct an attack by sending remote instructions to each bot. When a...
Read more >
Cluster-Aware Updating requirements and best practices
When you use the Cluster-Aware Updating window to apply updates and to configure self-updating options, the Remote Shutdown Windows Firewall ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found