agent container: high load and memory issue with many addresses
See original GitHub issueAfter having added 6000 addresses, the agent container constantly shows a high load and growing memory usage.
Test setup: Test was done on minikube using EnMasse master (last commit from 2018-03-16). Extra memoryUsage log output (with preceding gc() invocation) was added to the agent (10s interval). Using a script, 3000 pooled-queue and 3000 standard-anycast addresses have been created by directly adding the corresponding K8s ConfigMaps. In order not to create too many router/broker instances, the credit settings in the plan.yaml have been reduced so that there are 2 brokers and 4 routers in the end for the created addresses. The admin container was configured with a memory setting of “2Gi”, readiness/liveness check timeout and failureThreshold have been increased.
Outcome: Load/CPU usage of agent container after all addresses are active:
top - 19:16:14 up 1:04, 0 users, load average: 11.22, 10.08, 7.89
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 1990632 1.044g 17880 R 86.1 10.7 13:20.65 node
top - 19:16:15 up 1:04, 0 users, load average: 11.28, 10.11, 7.91
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 2004264 1.057g 17880 R 60.0 10.8 13:21.25 node
top - 19:16:16 up 1:04, 0 users, load average: 11.28, 10.11, 7.91
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 2028936 1.080g 17880 R 79.2 11.1 13:22.05 node
top - 19:16:17 up 1:04, 0 users, load average: 11.28, 10.11, 7.91
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 2039496 1.090g 17880 R 76.0 11.2 13:22.81 node
top - 19:16:18 up 1:04, 0 users, load average: 11.28, 10.11, 7.91
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 2050056 1.101g 17880 R 88.0 11.3 13:23.69 node
top - 19:16:19 up 1:04, 0 users, load average: 11.28, 10.11, 7.91
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 2063576 1.113g 17880 R 98.0 11.4 13:24.67 node
top - 19:16:20 up 1:04, 0 users, load average: 10.46, 9.96, 7.88
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 2076248 1.125g 17880 R 93.1 11.5 13:25.61 node
Memory usage in agent container is growing (all addresses active here) :
2018-03-21T19:11:33.326Z agent info memoryUsage: {"rss":974401536,"heapTotal":861790208,"heapUsed":706318016,"external":25677989}
2018-03-21T19:11:44.273Z agent info memoryUsage: {"rss":970465280,"heapTotal":863887360,"heapUsed":700740512,"external":25026009}
2018-03-21T19:11:55.245Z agent info memoryUsage: {"rss":993488896,"heapTotal":882761728,"heapUsed":710079856,"external":27841162}
2018-03-21T19:12:06.005Z agent info memoryUsage: {"rss":1048915968,"heapTotal":893153280,"heapUsed":736778840,"external":24560837}
2018-03-21T19:12:17.149Z agent info memoryUsage: {"rss":1032990720,"heapTotal":920416256,"heapUsed":751109352,"external":27509007}
2018-03-21T19:12:28.139Z agent info memoryUsage: {"rss":1002680320,"heapTotal":896299008,"heapUsed":746150800,"external":21346512}
2018-03-21T19:12:39.276Z agent info memoryUsage: {"rss":1010847744,"heapTotal":905736192,"heapUsed":737315864,"external":25870521}
2018-03-21T19:12:50.029Z agent info memoryUsage: {"rss":1188294656,"heapTotal":920387584,"heapUsed":753642904,"external":27052203}
2018-03-21T19:13:01.348Z agent info memoryUsage: {"rss":1037041664,"heapTotal":931921920,"heapUsed":777528632,"external":32383563}
2018-03-21T19:13:12.594Z agent info memoryUsage: {"rss":1155084288,"heapTotal":934019072,"heapUsed":773970176,"external":23823693}
2018-03-21T19:13:23.916Z agent info memoryUsage: {"rss":1048711168,"heapTotal":942407680,"heapUsed":785318568,"external":24669724}
2018-03-21T19:13:35.082Z agent info memoryUsage: {"rss":1167368192,"heapTotal":941359104,"heapUsed":779608152,"external":22198545}
2018-03-21T19:13:46.762Z agent info memoryUsage: {"rss":1053048832,"heapTotal":947650560,"heapUsed":795987432,"external":33385782}
2018-03-21T19:13:57.756Z agent info memoryUsage: {"rss":1187852288,"heapTotal":954990592,"heapUsed":791434968,"external":21823205}
2018-03-21T19:14:08.970Z agent info memoryUsage: {"rss":1063333888,"heapTotal":959184896,"heapUsed":782595680,"external":26914455}
2018-03-21T19:14:20.293Z agent info memoryUsage: {"rss":1226559488,"heapTotal":984350720,"heapUsed":822691040,"external":29027519}
2018-03-21T19:14:31.422Z agent info memoryUsage: {"rss":1179160576,"heapTotal":972816384,"heapUsed":824286960,"external":20654142}
2018-03-21T19:14:42.673Z agent info memoryUsage: {"rss":1082613760,"heapTotal":978059264,"heapUsed":820064736,"external":23986994}
2018-03-21T19:14:53.794Z agent info memoryUsage: {"rss":1090334720,"heapTotal":982253568,"heapUsed":817118512,"external":31199966}
2018-03-21T19:15:05.195Z agent info memoryUsage: {"rss":1110749184,"heapTotal":1002176512,"heapUsed":831922624,"external":27058201}
2018-03-21T19:15:16.391Z agent info memoryUsage: {"rss":1106104320,"heapTotal":995885056,"heapUsed":827183080,"external":24427066}
2018-03-21T19:15:28.724Z agent info memoryUsage: {"rss":1107951616,"heapTotal":997994496,"heapUsed":852140232,"external":21534674}
2018-03-21T19:15:40.366Z agent info memoryUsage: {"rss":1249570816,"heapTotal":1002188800,"heapUsed":847199896,"external":23825987}
2018-03-21T19:15:51.997Z agent info memoryUsage: {"rss":1116422144,"heapTotal":1006383104,"heapUsed":859315112,"external":24426907}
2018-03-21T19:16:03.709Z agent info memoryUsage: {"rss":1120550912,"heapTotal":1010577408,"heapUsed":854869616,"external":24401762}
2018-03-21T19:16:14.735Z agent info memoryUsage: {"rss":1255944192,"heapTotal":1010577408,"heapUsed":850406296,"external":23955893}
2018-03-21T19:16:25.702Z agent info memoryUsage: {"rss":1250340864,"heapTotal":1016868864,"heapUsed":876385064,"external":23322844}
2018-03-21T19:16:36.704Z agent info memoryUsage: {"rss":1278078976,"heapTotal":1029451776,"heapUsed":868400384,"external":25833605}
Note that even having set the liveness timeout to 3s, failureThreshold to 10 and period to 30s for the agent container, it still restarted once because of failed liveness checks.
See also the complete logs: logs.zip
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
Thanks! Looking at the K8s dashboard, CPU usage of the admin pod is down from 0.9 to 0.04 (with 1000 addresses).
Quick update: I have made some improvements, particular for steady state, and am finishing off some further changes that will hopefully improve the responsiveness at the point the 6k addresses are created. Did not quite get this completed yet and am off next week, but will be pursuing this and related issues as top priority on my return on 16th.