Possible memory leak in NodeJS / Python services
See original GitHub issueUptime checks for the production deployment of OnlineBoutique have been failing once every few weeks. Looking at kubectl events
timed with an uptime check failure –
38m Warning NodeSysctlChange node/gke-online-boutique-mast-default-pool-65a22575-azeq {"unmanaged": {"net.ipv4.tcp_fastopen_key": "004baa97-3c3b554d-9bbcccf8-870ced36"}}
43m Warning NodeSysctlChange node/gke-online-boutique-mast-default-pool-65a22575-i6m8 {"unmanaged": {"net.ipv4.tcp_fastopen_key": "706b7d5f-9df4b412-e8eb875e-179c4765"}}
46m Warning NodeSysctlChange node/gke-online-boutique-mast-default-pool-65a22575-jvwz {"unmanaged": {"net.ipv4.tcp_fastopen_key": "a0f734c5-5c9a56e1-06aeb420-0010498e"}}
39m Warning OOMKilling node/gke-online-boutique-mast-default-pool-65a22575-jvwz Memory cgroup out of memory: Kill process 569290 (node) score 2181 or sacrifice child
Killed process 569290 (node) total-vm:1418236kB, anon-rss:121284kB, file-rss:33236kB, shmem-rss:0kB
39m Warning OOMKilling node/gke-online-boutique-mast-default-pool-65a22575-jvwz Memory cgroup out of memory: Kill process 2592522 (grpc_health_pro) score 1029 or sacrifice child
Killed process 2592530 (grpc_health_pro) total-vm:710956kB, anon-rss:1348kB, file-rss:7376kB, shmem-rss:0kB
It looks like memory requests are exceeding their limit. There seems to be plenty of allocatable memory across the prod GKE nodes
But as observed by @bourgeoisor, it seems that three of the workloads are using steadily increasing amounts of memory until the pods are killed by GKE.
Currency and payment (NodeJS):
Recommendation: (Python)
TODO - investigate possible memory leaks starting with the NodeJS services. Investigate why the services use an increasing amount of memory over time rather than a constant amount. Then investigate the Python services + see if other python services (emailservice, for instance) show the same behavior as recommendation service.
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (6 by maintainers)
Top GitHub Comments
According to the profiler data for the
currencyservice
andserviceservice
therequest-retry
package is the one that seems to be using a lot of memory. It is imported by thegoogle-cloud/common
library that is used bygoogle-cloud/tracing
,google-cloud/debug
andgoogle-cloud/profiler
.The same behaviour is reported in the
google-clou/debug nodejs
repository. As per this recent comment the issue seems to have been eradicated after disablinggoogle-cloud/debug
I have created for PRs to stage 4 clusters with different settings to observe how the memory usage is over time.
google-cloud/debug
google-cloud/trace
google-cloud/profiler
Hello @NimJay
There isn’t much we can do from our side. I have communicated with Ben and seeing if we can work with the debug team to get that issue (https://github.com/googleapis/cloud-debug-nodejs/issues/811) fixed. Until then, no action is needed/possible from our side. I suggest we keep this issue open!