Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Operator is leaking files in /tmp, running out of disk space

See original GitHub issue

Please use this to only for bug reports. For questions or when you need help, you can use the GitHub Discussions, our #strimzi Slack channel or out user mailing list.

Describe the bug

After running fine for severel weeks I noticed that the operator stopped processing new CRDs. When I checked the logs it printed something about /tmp and disk full. When I started a shell in the operator pod and checked /tmp/ it was full of files and directories, and free disk space of tmpfs was indeed at 0%.

I upgraded the operator to the recent version, which of course restarted it and cleared /tmp. Today I checked again, and new files already start to pile up, so it seems that bug is still there.

[strimzi@strimzi-cluster-operator-7845cc6994-8kwt7 strimzi]$ ls -al /tmp/
total 4
drwxrwxrwt 9 root    root  180 Jun  9 11:04 .
drwxr-xr-x 1 root    root 4096 Jun  8 22:04 ..
drwxr-xr-x 2 strimzi root   80 Jun  8 22:04 hsperfdata_strimzi
drwx------ 2 strimzi root   40 Jun  7 01:20 vertx-cache-10d589cd-960b-4b7d-acb8-004c370bbba6
drwx------ 2 strimzi root   40 Jun  3 07:47 vertx-cache-3c368a93-8142-4705-88e8-41e4e61fdfb9
drwx------ 2 strimzi root   40 Jun  1 11:03 vertx-cache-67e1066c-d66f-43e8-9920-b0748d8fa718
drwx------ 2 strimzi root   40 Jun  8 22:04 vertx-cache-8e23f4b7-a71d-4cdf-9e78-f3b401f9ba16
drwx------ 2 strimzi root   40 May 30 08:07 vertx-cache-9316e7c6-33d3-4152-afcd-d0af00aaf7cc
drwx------ 2 strimzi root   40 Jun  5 04:34 vertx-cache-cf4d0cfc-cde3-4178-9d84-7936be517402
[strimzi@strimzi-cluster-operator-7845cc6994-8kwt7 strimzi]$ df /tmp/
Filesystem     1K-blocks  Used Available Use% Mounted on
tmpfs               1024   160       864  16% /tmp

To Reproduce Steps to reproduce the behavior:

Install operator. Let it run for several weeks or months without restarting the pod.

Expected behavior

Files in /tmp should be deleted when no longer used. /tmp/ should never run out of disk space.

Environment (please complete the following information):

Strimzi version: 0.29.0
Installation method: [Helm chart 0.29.0 from https://strimzi.io/charts]
Kubernetes cluster: [k8s 1.22]
Infrastructure: [on premise]

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

scholzjcommented, Jun 20, 2022

@elluvium Normally, the /tmp directory should have one of the vertx-cache-* directories. But when the container restarts inside the Pod, it normally keeps the same storage. But Vert.x creates a new cache directory. So with container restarts, new and new cache directories will be created. Their content should be small, but at the end it is only question of time until they use all the space. Only when you delete the pod, the /tmp storage is deleted as well and the container starts with a clean slate. So the problem has to possible solutions:

Stop the container from restarting => this is the ideal solution, but not always easy and might have many different causes (although the primary issue here seems to be OoM, so it is fairly clear in this case - but there might be other cases as well). When it doesn’t restart, it will keep using the same cache directory and should not run out of disk space.
Clean the storage when it restarts => the container does not need to keep anything in /tmp from the previous run, so by deleting the cache directories at startup if any exist, you can avoid running out of disk space. This is obviously not as perfect as not restarting it. But it should be easy to do and should work for all kind of different situations.

0reactions

scholzjcommented, Jun 20, 2022

@elluvium AFAIK, the containers might get restarted individually if you have more of them inside the Pod. What I’m trying to distinguish is the restart when a container exists and is started again (which I guess you could call Pod restart as well) and the old Pod being deleted and new Pod being created by the Deployment / Replica Set.

Basically, if you do something like kubectl get pods -o wide, you should see something like this:

NAMESPACE        NAME                                          READY   STATUS    RESTARTS   AGE   IP              NODE                  NOMINATED NODE   READINESS GATES
infra-namespace   strimzi-cluster-operator-5d74667679-b59r9     1/1     Running             0          60s   172.16.94.191   192.168.1.72.xip.io   <none>           <none>

And the RESTARTS column shows the restarts of the containers (or of the Pod if you want).