[BUG] Minion did not return. [No response] with some random minions
See original GitHub issueDescription With masters started since few weeks, salt commands failed with “Minion did not return. [No response]” on some random minions. The minions who do not respond are not the same if the salt command is rerun. I reproduce the issue when I target a single minion with a test.ping command. Restarting the master service solve the issue, but it appear again after some days/weeks without restarting the service.
Setup One master and few minions registred.
Steps to Reproduce the behavior It’s not easy to reproduced the issue, the master service must be started since several days/weeks, and it appears randomly. I reproduced the issue with a single minion on a test.ping command.
- Salt command line create the test.ping job
- Minion execute the job and send the return to the master
- Salt command line doesn’t receive the response and create a find_job job
- Minion response to the find_job with an empty response because the initial job is already executed
- Salt command ends with “Minion did not return. [No response]”
- At the same time, the Master event bus doesn’t display the response of the test.ping command
- A jobs.lookup_jid correctly retrieve the response sent by the minion
Some responses seems to be dropped by the master event bus.
master logs
# salt -l debug myserver test.ping
[DEBUG ] Reading configuration from /etc/salt/master
[DEBUG ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG ] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/master', u'myserver.dsone.3ds.com_master', u'tcp://127.0.0.1:4506', u'clear')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
[DEBUG ] Trying to connect to: tcp://127.0.0.1:4506
[DEBUG ] Closing AsyncZeroMQReqChannel instance
[DEBUG ] LazyLoaded local_cache.get_load
[DEBUG ] Reading minion list from /var/cache/salt/master/jobs/20/1aa1dda811f2bdb606bb78bba4ff9f3da4f8ad23a7da18d121a25ee34fb5b7/.minions.p
[DEBUG ] get_iter_returns for jid 20200618175516640328 sent to set(['myserver']) will timeout at 17:55:21.656635
[DEBUG ] Checking whether jid 20200618175516640328 is still running
[DEBUG ] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/master', u'myserver.dsone.3ds.com_master', u'tcp://127.0.0.1:4506', u'clear')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
[DEBUG ] Trying to connect to: tcp://127.0.0.1:4506
[DEBUG ] Closing AsyncZeroMQReqChannel instance
[DEBUG ] Passing on saltutil error. Key 'u'retcode' missing from client return. This may be an error in the client.
[DEBUG ] return event: {'myserver': {u'failed': True}}
myserver:
Minion did not return. [No response]
[DEBUG ] Closing IPCMessageSubscriber instance
ERROR: Minions returned with non-zero exit code
master event bus
# salt-run state.event pretty=True
20200618175516640328 {
"_stamp": "2020-06-18T15:55:16.640731",
"minions": [
"myserver"
]
}
salt/job/20200618175516640328/new {
"_stamp": "2020-06-18T15:55:16.641871",
"arg": [],
"fun": "test.ping",
"jid": "20200618175516640328",
"minions": [
"myserver"
],
"missing": [],
"tgt": "myserver",
"tgt_type": "glob",
"user": "root"
}
20200618175521762170 {
"_stamp": "2020-06-18T15:55:21.762531",
"minions": [
"myserver"
]
}
salt/job/20200618175521762170/new {
"_stamp": "2020-06-18T15:55:21.763894",
"arg": [
"20200618175516640328"
],
"fun": "saltutil.find_job",
"jid": "20200618175521762170",
"minions": [
"myserver"
],
"missing": [],
"tgt": [
"myserver"
],
"tgt_type": "list",
"user": "root"
}
salt/job/20200618175521762170/ret/myserver {
"_stamp": "2020-06-18T15:55:21.861260",
"cmd": "_return",
"fun": "saltutil.find_job",
"fun_args": [
"20200618175516640328"
],
"id": "myserver",
"jid": "20200618175521762170",
"master_id": "myserver",
"retcode": 0,
"return": {},
"success": true
}
job lookup
[root@myserver ~]# salt-run -l info jobs.lookup_jid 20200618175516640328
myserver:
True
[INFO ] Runner completed: 20200619091453151012
minion logs
2020-06-18 17:55:16,652 [salt.minion :1482][INFO ][21477] User root Executing command test.ping with jid 20200618175516640328
2020-06-18 17:55:16,653 [salt.minion :1489][DEBUG ][21477] Command details {u'tgt_type': u'glob', u'jid': u'20200618175516640328', u'tgt': u'myserver', u'ret': u'', u'user': u'root', u'arg': [], u'fun': u'test.ping', u'master_id': u'myserver'}
2020-06-18 17:55:16,657 [salt.utils.process:860 ][DEBUG ][21477] Subprocess SignalHandlingMultiprocessingProcess-1:8-Job-20200618175516640328 added
2020-06-18 17:55:16,716 [salt.utils.lazy :104 ][DEBUG ][22892] LazyLoaded jinja.render
2020-06-18 17:55:16,719 [salt.utils.lazy :104 ][DEBUG ][22892] LazyLoaded yaml.render
2020-06-18 17:55:16,721 [salt.minion :1609][INFO ][22892] Starting a new job 20200618175516640328 with PID 22892
2020-06-18 17:55:16,724 [salt.utils.lazy :107 ][DEBUG ][22892] Could not LazyLoad {0}.allow_missing_func: '{0}.allow_missing_func' is not available.
2020-06-18 17:55:16,742 [salt.utils.lazy :104 ][DEBUG ][22892] LazyLoaded test.ping
2020-06-18 17:55:16,743 [salt.loaded.int.module.test:124 ][DEBUG ][22892] test.ping received for minion 'myserver'
2020-06-18 17:55:16,743 [salt.minion :807 ][DEBUG ][22892] Minion return retry timer set to 10 seconds (randomized)
2020-06-18 17:55:16,744 [salt.minion :1937][INFO ][22892] Returning information for job: 20200618175516640328
2020-06-18 17:55:16,745 [salt.transport.zeromq:138 ][DEBUG ][22892] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/minion', u'myserver', u'tcp://10.81.105.213:4506', u'aes')
2020-06-18 17:55:16,746 [salt.crypt :464 ][DEBUG ][22892] Initializing new AsyncAuth for (u'/etc/salt/pki/minion', u'myserver', u'tcp://10.81.105.213:4506')
2020-06-18 17:55:16,747 [salt.transport.zeromq:209 ][DEBUG ][22892] Connecting the Minion to the Master URI (for the return server): tcp://10.81.105.213:4506
2020-06-18 17:55:16,748 [salt.transport.zeromq:1189][DEBUG ][22892] Trying to connect to: tcp://10.81.105.213:4506
2020-06-18 17:55:16,756 [salt.transport.zeromq:233 ][DEBUG ][22892] Closing AsyncZeroMQReqChannel instance
2020-06-18 17:55:16,758 [salt.minion :1787][DEBUG ][22892] minion return: {u'fun_args': [], u'jid': u'20200618175516640328', u'return': True, u'retcode': 0, u'success': True, u'fun': u'test.ping', u'master_id': u'myserver'}
2020-06-18 17:55:17,717 [salt.utils.process:869 ][DEBUG ][21477] Subprocess SignalHandlingMultiprocessingProcess-1:8-Job-20200618175516640328 cleaned up
2020-06-18 17:55:21,775 [salt.minion :1482][INFO ][21477] User root Executing command saltutil.find_job with jid 20200618175521762170
2020-06-18 17:55:21,776 [salt.minion :1489][DEBUG ][21477] Command details {u'tgt_type': u'list', u'jid': u'20200618175521762170', u'tgt': [u'myserver'], u'ret': u'', u'user': u'root', u'arg': [u'20200618175516640328'], u'fun': u'saltutil.find_job', u'master_id': u'myserver'}
2020-06-18 17:55:21,779 [salt.utils.process:860 ][DEBUG ][21477] Subprocess SignalHandlingMultiprocessingProcess-1:9-Job-20200618175521762170 added
2020-06-18 17:55:21,838 [salt.utils.lazy :104 ][DEBUG ][22904] LazyLoaded jinja.render
2020-06-18 17:55:21,841 [salt.utils.lazy :104 ][DEBUG ][22904] LazyLoaded yaml.render
2020-06-18 17:55:21,844 [salt.minion :1609][INFO ][22904] Starting a new job 20200618175521762170 with PID 22904
2020-06-18 17:55:21,847 [salt.utils.lazy :107 ][DEBUG ][22904] Could not LazyLoad {0}.allow_missing_func: '{0}.allow_missing_func' is not available.
2020-06-18 17:55:21,850 [salt.utils.lazy :104 ][DEBUG ][22904] LazyLoaded saltutil.find_job
2020-06-18 17:55:21,852 [salt.minion :807 ][DEBUG ][22904] Minion return retry timer set to 6 seconds (randomized)
2020-06-18 17:55:21,852 [salt.minion :1937][INFO ][22904] Returning information for job: 20200618175521762170
2020-06-18 17:55:21,853 [salt.transport.zeromq:138 ][DEBUG ][22904] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/minion', u'myserver', u'tcp://10.81.105.213:4506', u'aes')
2020-06-18 17:55:21,854 [salt.crypt :464 ][DEBUG ][22904] Initializing new AsyncAuth for (u'/etc/salt/pki/minion', u'myserver', u'tcp://10.81.105.213:4506')
2020-06-18 17:55:21,856 [salt.transport.zeromq:209 ][DEBUG ][22904] Connecting the Minion to the Master URI (for the return server): tcp://10.81.105.213:4506
2020-06-18 17:55:21,857 [salt.transport.zeromq:1189][DEBUG ][22904] Trying to connect to: tcp://10.81.105.213:4506
2020-06-18 17:55:21,865 [salt.transport.zeromq:233 ][DEBUG ][22904] Closing AsyncZeroMQReqChannel instance
2020-06-18 17:55:21,866 [salt.minion :1787][DEBUG ][22904] minion return: {u'fun_args': [u'20200618175516640328'], u'jid': u'20200618175521762170', u'return': {}, u'retcode': 0, u'success': True, u'fun': u'saltutil.find_job', u'master_id': u'myserver'}
2020-06-18 17:55:22,717 [salt.utils.process:869 ][DEBUG ][21477] Subprocess SignalHandlingMultiprocessingProcess-1:9-Job-20200618175521762170 cleaned up
Expected behavior Responses sent by minions must be returned by the command line.
Versions Report
salt --versions-report
Salt Version:
Salt: 2019.2.4
Dependency Versions:
cffi: 1.6.0
cherrypy: Not Installed
dateutil: 1.5
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: 0.26.3
libnacl: Not Installed
M2Crypto: 0.21.1
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.6.2
mysql-python: Not Installed
pycparser: 2.19
pycrypto: 2.6.1
pycryptodome: 3.9.7
pygit2: 0.26.4
Python: 2.7.5 (default, Jun 11 2019, 14:33:56)
python-gnupg: 0.4.4
PyYAML: 3.10
PyZMQ: 15.3.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.1.4
System Versions:
dist: redhat 7.5 Maipo
locale: UTF-8
machine: x86_64
release: 3.10.0-1127.8.2.el7.x86_64
system: Linux
version: Red Hat Enterprise Linux Server 7.5 Maipo
Issue Analytics
- State:
- Created 3 years ago
- Reactions:3
- Comments:35 (18 by maintainers)
I can confirm that we also see it in our 3002.2. Hope next release should fix it.
We are seeing the same thing with our windows minions that exist in somewhat higher latency environments. We have minions in Australia and europe that have connection issues with the master in us-east-1.
Any of our windows minions where the latency stays <50ms do not need the tuning.
The linux minions do not seem to suffer from the same issue and stay connected.
Aggressively tuning the tcp_keepalive settings on the windows minions seems to stabilize these minions