[BUG] Cannot allocate memory when running test.ping
See original GitHub issueDescription I upgraded from 2017.something to 2019.2.5 a little while back when there was that major security issue, and today I encountered this:
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: Process Maintenance-5:
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: Traceback (most recent call last):
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: self.run()
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/dist-packages/salt/master.py", line 234, in run
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: salt.daemons.masterapi.clean_old_jobs(self.opts)
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/dist-packages/salt/daemons/masterapi.py", line 169, in clean_old_jobs
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: rend=False,
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 887, in __init__
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: role='master'
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/dist-packages/salt/config/__init__.py", line 2464, in minion_config
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: minion_id=minion_id)
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/dist-packages/salt/config/__init__.py", line 3802, in apply_minion_config
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: cache_minion_id=cache_minion_id)
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/dist-packages/salt/config/__init__.py", line 3684, in get_id
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: newid = salt.utils.network.generate_minion_id()
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/dist-packages/salt/utils/network.py", line 181, in generate_minion_id
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: ret = salt.utils.stringutils.to_unicode(_generate_minion_id().first())
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/dist-packages/salt/utils/network.py", line 170, in _generate_minion_id
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: return hosts.extend([addr for addr in ip_addrs()
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/dist-packages/salt/utils/network.py", line 1288, in ip_addrs
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: return _ip_addrs(interface, include_loopback, interface_data, 'inet')
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/dist-packages/salt/utils/network.py", line 1263, in _ip_addrs
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: else interfaces()
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/dist-packages/salt/utils/network.py", line 1056, in interfaces
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: return linux_interfaces()
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/dist-packages/salt/utils/network.py", line 853, in linux_interfaces
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: stderr=subprocess.STDOUT).communicate()[0]
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/subprocess.py", line 390, in __init__
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: errread, errwrite)
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: File "/usr/lib/python2.7/subprocess.py", line 916, in _execute_child
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: self.pid = os.fork()
Jun 16 16:37:46 ip-172-31-16-10 salt-master[13325]: OSError: [Errno 12] Cannot allocate memory
Jun 16 16:37:54 ip-172-31-16-10 salt-master[13325]: [INFO ] Process <class 'salt.master.Maintenance'> (13342) died with exit status 1, restarting...
This happened when I ran sudo salt '*' test.ping
on the master. All 13 minions on the network returned True
, but the command just got stuck there, never returning me to the shell.
Then I noticed the master host started swapping (I had another SSH shell open and it started running slow so I checked). Then a moment later, the entire box froze (for a moment I thought it had crashed) as the salt-master must have quickly exhausted all the swap. Then the salt-master process appears to have crashed and restarted (thanks to systemd), which brought the box back to life again.
I’ve never seen this happen before today. Thankfully it was in a small staging environment I run, using Debian’s official AMIs. I was just in the process of rotating out the previous us-west-2 AMI (ami-0d270a69ac13b22c3) with the new one (ami-0c90f7501bcd55772) so some minions were running on slightly different AMIs, although I have no reason to suspect this problem is anything related to the AMI itself. I’m just mentioning it for completeness.
Setup These were the AWS network interfaces:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group default qlen 1000
link/ether 06:dc:0f:2f:4b:32 brd ff:ff:ff:ff:ff:ff
inet 172.31.16.10/20 brd 172.31.31.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::4dc:fff:fe2f:4b32/64 scope link
valid_lft forever preferred_lft forever
The instance was configured with 2Gb of RAM, and 2Gb of swap. As stated above, there was only 13 minions. Obviously we don’t want to pay for bigger EC2 instances for a lightly used staging environment if it’s not necessary, although I suspect that doing so would not have helped.
Here’s the ps auxf
output of the box under normal circumstances:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2 0.0 0.0 0 0 ? S Jun02 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S Jun02 0:03 \_ [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [kworker/0:0H]
root 7 0.0 0.0 0 0 ? S Jun02 0:51 \_ [rcu_sched]
root 8 0.0 0.0 0 0 ? S Jun02 0:00 \_ [rcu_bh]
root 9 0.0 0.0 0 0 ? S Jun02 0:12 \_ [migration/0]
root 10 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [lru-add-drain]
root 11 0.0 0.0 0 0 ? S Jun02 0:01 \_ [watchdog/0]
root 12 0.0 0.0 0 0 ? S Jun02 0:00 \_ [cpuhp/0]
root 13 0.0 0.0 0 0 ? S Jun02 0:00 \_ [cpuhp/1]
root 14 0.0 0.0 0 0 ? S Jun02 0:01 \_ [watchdog/1]
root 15 0.0 0.0 0 0 ? S Jun02 0:12 \_ [migration/1]
root 16 0.0 0.0 0 0 ? S Jun02 0:04 \_ [ksoftirqd/1]
root 18 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [kworker/1:0H]
root 19 0.0 0.0 0 0 ? S Jun02 0:00 \_ [kdevtmpfs]
root 20 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [netns]
root 21 0.0 0.0 0 0 ? S Jun02 0:00 \_ [khungtaskd]
root 22 0.0 0.0 0 0 ? S Jun02 0:00 \_ [oom_reaper]
root 23 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [writeback]
root 24 0.0 0.0 0 0 ? S Jun02 0:00 \_ [kcompactd0]
root 26 0.0 0.0 0 0 ? SN Jun02 0:00 \_ [ksmd]
root 27 0.0 0.0 0 0 ? SN Jun02 0:00 \_ [khugepaged]
root 28 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [crypto]
root 29 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [kintegrityd]
root 30 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [bioset]
root 31 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [kblockd]
root 32 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [devfreq_wq]
root 33 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [watchdogd]
root 37 0.0 0.0 0 0 ? S Jun02 6:22 \_ [kswapd0]
root 38 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [vmstat]
root 50 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [kthrotld]
root 52 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [ipv6_addrconf]
root 87 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [ena]
root 89 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [nvme]
root 120 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [bioset]
root 121 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [bioset]
root 143 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [kworker/u5:0]
root 154 0.0 0.0 0 0 ? S Jun02 0:09 \_ [jbd2/nvme0n1p1-]
root 155 0.0 0.0 0 0 ? S< Jun02 0:00 \_ [ext4-rsv-conver]
root 190 0.0 0.0 0 0 ? S Jun02 0:00 \_ [kauditd]
root 214 0.0 0.0 0 0 ? S< Jun02 0:03 \_ [kworker/1:1H]
root 255 0.0 0.0 0 0 ? S< Jun02 0:03 \_ [kworker/0:1H]
root 8615 0.0 0.0 0 0 ? S 17:09 0:00 \_ [kworker/u4:1]
root 12546 0.0 0.0 0 0 ? S 17:10 0:00 \_ [kworker/1:2]
root 12549 0.0 0.0 0 0 ? S 17:10 0:00 \_ [kworker/0:3]
root 21206 0.0 0.0 0 0 ? S 17:22 0:00 \_ [kworker/0:0]
root 21209 0.0 0.0 0 0 ? S 17:22 0:00 \_ [kworker/1:1]
root 22765 0.0 0.0 0 0 ? S 17:24 0:00 \_ [kworker/u4:2]
root 25011 0.0 0.0 0 0 ? S 17:30 0:00 \_ [kworker/1:0]
root 1 0.0 0.1 204528 2992 ? Ss Jun02 0:11 /sbin/init
root 206 0.0 0.0 45396 12 ? Ss Jun02 0:01 /lib/systemd/systemd-udevd
root 306 0.0 0.0 20476 508 ? Ss Jun02 0:00 /sbin/dhclient -4 -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases eth0
root 379 0.0 0.0 20356 212 ? Ss Jun02 0:01 /sbin/dhclient -v -6 -nw -pf /run/dhclient-6.eth0.pid -lf /var/lib/dhcp/dhclient-6.eth0.leases -I -df /var/lib/dhcp/dhclient-6.eth0.leases eth0
root 477 0.0 0.0 29600 504 ? Ss Jun02 0:02 /usr/sbin/cron -f
root 480 0.0 0.0 46496 192 ? Ss Jun02 0:02 /lib/systemd/systemd-logind
message+ 481 0.0 0.0 45128 236 ? Ss Jun02 0:02 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
root 500 0.0 0.0 14300 0 ttyS0 Ss+ Jun02 0:00 /sbin/agetty --keep-baud 115200,38400,9600 ttyS0 vt220
root 501 0.0 0.0 14524 0 tty1 Ss+ Jun02 0:00 /sbin/agetty --noclear tty1 linux
root 12928 0.0 0.0 69956 180 ? Ss Jun02 0:03 /usr/sbin/sshd -D
root 15774 0.0 0.0 95180 20 ? Ss 10:16 0:00 \_ sshd: debian [priv]
debian 15783 0.0 0.0 96708 552 ? S 10:17 0:03 \_ sshd: debian@pts/0,pts/1,pts/2
debian 30436 0.0 0.1 24004 3468 pts/1 Ss 14:01 0:00 \_ -bash
debian 26163 0.0 0.1 40044 3284 pts/1 R+ 17:35 0:00 | \_ ps auxf
debian 26164 0.0 0.0 16460 1036 pts/1 S+ 17:35 0:00 | \_ sed s/debian/debian/g
root 19607 0.0 0.2 59352 4280 ? Ss Jun02 1:10 /lib/systemd/systemd-journald
ntp 21033 0.0 0.0 102108 132 ? Ssl Jun02 0:58 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 109:114
root 10518 0.0 0.0 174348 352 ? Ss Jun03 0:00 /usr/bin/python /usr/bin/salt-minion
root 10546 0.0 2.4 645888 48692 ? Sl Jun03 11:52 \_ /usr/bin/python /usr/bin/salt-minion
root 11306 0.0 0.0 256192 388 ? S Jun03 0:00 \_ /usr/bin/python /usr/bin/salt-minion
debian 15779 0.0 0.0 64872 920 ? Ss 10:17 0:00 /lib/systemd/systemd --user
debian 15780 0.0 0.0 230052 4 ? S 10:17 0:00 \_ (sd-pam)
root 6995 0.0 0.1 53388 3240 ? Ss 14:21 0:01 /usr/bin/perl -wT /usr/sbin/munin-node
root 7115 0.0 0.0 278092 772 ? Ssl 14:21 0:01 /usr/sbin/rsyslogd -n
root 28619 0.0 0.0 81188 148 ? Ss 14:59 0:00 /usr/lib/postfix/sbin/master -w
postfix 28621 0.0 0.0 83424 76 ? S 14:59 0:00 \_ qmgr -l -t fifo -u
postfix 27878 0.0 0.0 87492 1296 ? S 16:25 0:00 \_ tlsmgr -l -t unix -u
postfix 31161 0.0 0.0 83256 636 ? S 16:38 0:00 \_ pickup -l -t fifo -u
root 3008 0.0 0.4 232304 9844 ? Ss 17:07 0:00 /usr/bin/python /usr/bin/salt-master
root 3017 0.0 0.1 165544 3828 ? S 17:08 0:00 \_ /usr/bin/python /usr/bin/salt-master
root 3020 0.0 0.5 313364 10712 ? Sl 17:08 0:00 \_ /usr/bin/python /usr/bin/salt-master
root 3023 0.0 0.9 297640 20144 ? S 17:08 0:01 \_ /usr/bin/python /usr/bin/salt-master
root 3024 0.8 4.0 1141264 82796 ? Sl 17:08 0:14 \_ /usr/bin/python /usr/bin/salt-master
root 3025 0.5 3.1 252132 63364 ? S 17:08 0:09 \_ /usr/bin/python /usr/bin/salt-master
root 3026 0.0 0.4 233220 9968 ? S 17:08 0:00 \_ /usr/bin/python /usr/bin/salt-master
root 3029 0.1 0.5 1502184 10752 ? Sl 17:08 0:02 | \_ /usr/bin/python /usr/bin/salt-master
root 3030 0.2 0.7 335888 15464 ? Sl 17:08 0:03 | \_ /usr/bin/python /usr/bin/salt-master
root 3058 0.2 1.1 417836 24232 ? Sl 17:08 0:03 | \_ /usr/bin/python /usr/bin/salt-master
root 3068 0.3 1.1 491532 22876 ? Sl 17:08 0:05 | \_ /usr/bin/python /usr/bin/salt-master
root 3069 0.5 4.6 435448 93456 ? Sl 17:08 0:08 | \_ /usr/bin/python /usr/bin/salt-master
root 3072 0.4 4.6 585900 92952 ? Sl 17:08 0:07 | \_ /usr/bin/python /usr/bin/salt-master
root 3073 0.5 4.6 586700 94648 ? Sl 17:08 0:08 | \_ /usr/bin/python /usr/bin/salt-master
root 3076 0.4 3.8 355388 78092 ? Sl 17:08 0:07 | \_ /usr/bin/python /usr/bin/salt-master
root 3077 0.3 3.1 583704 64008 ? Sl 17:08 0:06 | \_ /usr/bin/python /usr/bin/salt-master
root 3082 0.6 2.8 511164 58452 ? Sl 17:08 0:10 | \_ /usr/bin/python /usr/bin/salt-master
root 3085 0.8 2.5 506484 51580 ? Sl 17:08 0:13 | \_ /usr/bin/python /usr/bin/salt-master
root 3088 0.8 3.2 583216 65412 ? Sl 17:08 0:14 | \_ /usr/bin/python /usr/bin/salt-master
root 3089 0.4 3.9 352460 80648 ? Sl 17:08 0:07 | \_ /usr/bin/python /usr/bin/salt-master
root 3097 0.4 4.5 369208 91008 ? Sl 17:08 0:08 | \_ /usr/bin/python /usr/bin/salt-master
root 3099 0.3 1.9 357276 38736 ? Sl 17:08 0:05 | \_ /usr/bin/python /usr/bin/salt-master
root 3103 0.2 3.8 433644 78236 ? Sl 17:08 0:04 | \_ /usr/bin/python /usr/bin/salt-master
root 3105 0.5 3.3 434828 67468 ? Sl 17:08 0:09 | \_ /usr/bin/python /usr/bin/salt-master
root 3112 0.4 2.7 581964 54752 ? Sl 17:08 0:07 | \_ /usr/bin/python /usr/bin/salt-master
root 3114 0.4 3.8 587592 77348 ? Sl 17:08 0:07 | \_ /usr/bin/python /usr/bin/salt-master
root 3122 0.5 3.1 507372 64088 ? Sl 17:08 0:09 | \_ /usr/bin/python /usr/bin/salt-master
root 3128 0.3 4.4 432040 89504 ? Sl 17:08 0:06 | \_ /usr/bin/python /usr/bin/salt-master
root 3133 0.1 1.0 478376 21172 ? Sl 17:08 0:01 | \_ /usr/bin/python /usr/bin/salt-master
root 3138 0.1 1.1 412860 22628 ? Sl 17:08 0:02 | \_ /usr/bin/python /usr/bin/salt-master
root 3144 0.1 1.2 478416 26028 ? Sl 17:08 0:02 | \_ /usr/bin/python /usr/bin/salt-master
root 3151 0.2 3.8 339932 77708 ? Sl 17:08 0:03 | \_ /usr/bin/python /usr/bin/salt-master
root 3155 0.1 1.2 412920 25012 ? Sl 17:08 0:01 | \_ /usr/bin/python /usr/bin/salt-master
root 3160 0.1 1.3 478476 27880 ? Sl 17:08 0:01 | \_ /usr/bin/python /usr/bin/salt-master
root 3164 0.2 1.0 420896 20548 ? Sl 17:08 0:04 | \_ /usr/bin/python /usr/bin/salt-master
root 3170 0.2 1.6 575240 33308 ? Sl 17:08 0:04 | \_ /usr/bin/python /usr/bin/salt-master
root 3177 0.3 4.6 434924 94184 ? Sl 17:08 0:05 | \_ /usr/bin/python /usr/bin/salt-master
root 3182 0.4 4.9 587388 98992 ? Sl 17:08 0:08 | \_ /usr/bin/python /usr/bin/salt-master
root 3185 0.2 1.1 484464 23884 ? Sl 17:08 0:03 | \_ /usr/bin/python /usr/bin/salt-master
root 3192 0.3 0.7 344700 16140 ? Sl 17:08 0:05 | \_ /usr/bin/python /usr/bin/salt-master
root 3197 0.3 2.8 429820 56896 ? Sl 17:08 0:05 | \_ /usr/bin/python /usr/bin/salt-master
root 3202 0.2 1.2 422060 25488 ? Sl 17:08 0:04 | \_ /usr/bin/python /usr/bin/salt-master
root 3027 0.1 0.5 306292 10900 ? Sl 17:08 0:02 \_ /usr/bin/python /usr/bin/salt-master
That’s probably more salt-master processes than required, but so long as there is enough swap there it shouldn’t be a problem. eg.
debian@ip-172-31-16-10:~$ free -m -t
total used free shared buff/cache available
Mem: 1972 1729 114 2 128 103
Swap: 2047 591 1456
Total: 4020 2320 1570
debian@ip-172-31-16-10:~$
Steps to Reproduce the behavior No idea. This looks very similar to the stack trace in https://github.com/saltstack/salt/issues/53261 which might offer some clues.
Expected behavior
Would not expect it to suddenly use so much RAM, particularly for running the test.ping
state. There was nothing left for Salt to do when this happened, as all hosts had already been accounted for in the test.ping
output, yet the command stalled. Even when running sudo salt '*' state.highstate
, I have never experienced it run out of RAM before. Normally there is at least one Gb of free swap space remaining. There are no other matches for “Cannot allocate” when searching through all master and minion logs on the same host. As per the above, there is nothing else of note running on the instance.
Salt Version:
Salt: 2019.2.5
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.5.3
docker-py: Not Installed
gitdb: 2.0.0
gitpython: 2.1.1
ioflo: Not Installed
Jinja2: 2.8
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: 0.24.0
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.8
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 2.7.13 (default, Sep 26 2018, 18:42:22)
python-gnupg: 0.3.9
PyYAML: 3.12
PyZMQ: 16.0.2
RAET: Not Installed
smmap: 2.0.1
timelib: Not Installed
Tornado: 4.4.3
ZMQ: 4.2.1
System Versions:
dist: debian 9.12
locale: UTF-8
machine: x86_64
release: 4.9.0-12-amd64
system: Linux
version: debian 9.12
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (8 by maintainers)
@boltronics for the time being, ill label this ticket as “blocked” with info needed so we know to be expecting some follow up from ya 😄
I tried it on AMI with 2G RAM but couldn’t reproduce it