Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Liveness probe performance issue

See original GitHub issue

Description of the issue

I deployed a ERPNext Helm chart on in-house K8S cluster. Recently, after upgrade to newer chart (from 1.0.0 to 1.0.14), I’ve been noticing significant, constant CPU usage. It turned out that it is caused by liveness probe execution for scheduler and all workers, ie. doctor.py script. It’s not that critical when the server is idle, but during heavy load it makes a significant performance impact.

First point is that exactly the same script is executed for multiple containers and doesn’t really check if a given container is live, but if all backend services are live. Another is that it’s a Python script, which by itself isn’t particularly optimal. Ie. every 5s 4 instances of docker-endpoint.sh are executed, which execute su (what spams in sys log as well), initiate Python env, load Python interpreter with all the required libraries and execute a script, which, in my understanding after briefly reading the code, basically does a TCP liveness check.

Context information (for bug reports)

Kubernetes cluster (v1.18.8) deployed on dedicated server on Fedora 33 with erpnext-nginx in version v12.10.1, erpnext-worker v12.10.1, socketio v12.8.4, helm chart v1.0.14.

Steps to reproduce the issue

Deploy ERPNext Helm chart.
Watch CPU usage as doctor.py is being executed every 5s for scheduler and all 3 workers.

Observed result

High CPU usage every 5s on a task that should be as optimised as possible (ie. liveness probe).

Expected result

No significant CPU load during liveness check.

Stacktrace / full error message if available

Not relevant.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:16

Top GitHub Comments

1reaction

revantcommented, Aug 26, 2020

no netcat used in frappe-nginx/docker-entrypoint.sh

timeout 3 bash -c 'until printf "" 2>>/dev/null >>/dev/tcp/$0/$1; do sleep 1; done' $PING_HOST $PING_PORT

concerns:

What if config changes due to anything and healthcheck keeps using /tmp/doctor.sh
reading json without jq. nodejs can read and parse json. It’s faster than python I’m sure.

1reaction

zapsoscommented, Aug 25, 2020

@MarekPikula Thanks this is great from the learning and debugging point of view. I am new to kubernetes and frappe helm so started learning liveness probe and readinessprobe with regards to understand frappe helm charts after you posted this bug. Hope we have solution so the process of learning continues!

Top Results From Across the Web

Liveness probe performance issue #346 - GitHub

0 to 1.0.14), I've been noticing significant, constant CPU usage. It turned out that it is caused by liveness probe execution for scheduler...

How to Troubleshoot and Address Liveness / Readiness ...

Liveness / Readiness probe failures suggest performances issues or slow startup. A quick workaround for such kind of issues is to update those...

Kubernetes Liveness and Readiness Probes: How to Avoid ...

One problem with a liveness probe is that the probe may not actually verify the responsiveness of the service.

Kubernetes Liveness Probe | Practical Guide - ContainIQ

Kubernetes uses liveness probes to detect issues within your pods. ... A probe that's run too frequently wastes resources and impedes performance; ...

Configure Liveness, Readiness and Startup Probes

This page shows how to configure liveness, readiness and startup probes for containers. The kubelet uses liveness probes to know when to restart ......