Celery Worker docker healthcheck causes a memory leak
See original GitHub issueApache Airflow version
2.2.3 (latest released)
What happened
With a docker setup as defined by this compose file, the airflow-worker
service healthcheck.test
command causes a general increase in memory use overtime, even when idle. This was observed with Airflow 2.1.4 and 2.2.3.
We observed this in our AWS ECS cluster which has a 0.5 CPU/1 GB Mem Worker setup. We strangely had a task fail (at the 2nd dip in memory use of the picture below), which prompted further investigation. The task had actually succeeded, but for some reason notified dependent tasks as failed. Subsequent tasks were marked as upstream failure, but the webserver reported the task as success. We noticed the metrics page looked like the image below.
We raised the CPU & Memory to 2 CPU / 4 GB Mem and restarted the service, which still produced a gradual increase in memory.
What you expected to happen
It should not increase in memory when the system is idle, but rather spike during healthcheck and release memory back to the host.
How to reproduce
We use a modified version of the compose file and instead favor docker stack, but the same setup should apply to the documented compose file. A slimmed down compose file is below. It has 2 workers, one with a healthcheck and one without.
A secondary script was written to scrape the docker statistics in 10 second intervals and write them to a CSV file.
Executing both commands can be done like so:
$ docker stack deploy -c docker-compose.yaml airflow
$ nohup ./collect_stats.sh > stats.csv &
The necessary files are below. I’m also including a sample of the CSV file run locally. worker_stats.csv
It shows that over a ~2 hour time period the general increase of airflow_worker_healthcheck
. It consumes ~45 MB per hour if the healthcheck occurs at 10 second intervals.
Date | Container | CPU Percent | Mem Usage | Mem Percent |
2022-01-21T19:17:57UTC | airflow_worker_no_healthcheck.1.z3lwru6mh22tzpe6dc8h8ekvi | 0.47% | 1.108GiB / 14.91GiB | 7.43% |
2022-01-21T19:17:57UTC | airflow_worker_healthcheck.1.vw9d1q18dx75v7w3zcfb7fypt | 0.57% | 1.1GiB / 14.91GiB | 7.38% |
2022-01-21T20:34:01UTC | airflow_worker_no_healthcheck.1.z3lwru6mh22tzpe6dc8h8ekvi | 0.28% | 1.108GiB / 14.91GiB | 7.43% |
2022-01-21T20:34:01UTC | airflow_worker_healthcheck.1.vw9d1q18dx75v7w3zcfb7fypt | 0.76% | 1.157GiB / 14.91GiB | 7.76% |
collect_stats.sh
#!/usr/bin/env sh
echo "Date,Container,CPU Percent,Mem Usage,Mem Percent"
while true; do
time=$(date --utc +%FT%T%Z)
docker stats \
--format "table {{.Name}},{{.CPUPerc}},{{.MemUsage}},{{.MemPerc}}" \
--no-stream \
| grep worker \
| awk -vT="${time}," '{ print T $0 }'
sleep 10
done
docker-compose.yaml
---
version: '3.7'
networks:
net:
driver: overlay
attachable: true
volumes:
postgres-data:
redis-data:
services:
postgres:
image: postgres:13.2-alpine
volumes:
- postgres-data:/var/lib/postgresql/data
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
healthcheck:
test: pg_isready -U airflow -d airflow
interval: 10s
timeout: 3s
start_period: 15s
ports:
- '5432:5432'
networks:
- net
redis:
image: redis:6.2
volumes:
- redis-data:/data
healthcheck:
test: redis-cli ping
interval: 10s
timeout: 3s
start_period: 15s
ports:
- '6379:6379'
networks:
- net
webserver:
image: apache/airflow:2.2.3-python3.8
command:
- bash
- -c
- 'airflow db init
&& airflow db upgrade
&& airflow users create --username admin --firstname Admin --lastname User --password admin --role Admin --email test@admin.org
&& airflow webserver'
environment:
AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
AIRFLOW__LOGGING__LOGGING_LEVEL: INFO
AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
healthcheck:
test: curl --fail http://localhost:8080/health
interval: 10s
timeout: 10s
retries: 10
start_period: 90s
ports:
- '8080:8080'
networks:
- net
scheduler:
image: apache/airflow:2.2.3-python3.8
command: scheduler
environment:
AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
AIRFLOW__LOGGING__LOGGING_LEVEL: INFO
AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
healthcheck:
test: airflow db check
interval: 20s
timeout: 10s
retries: 5
start_period: 40s
networks:
- net
worker_healthcheck:
image: apache/airflow:2.2.3-python3.8
command: celery worker
environment:
AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
AIRFLOW__LOGGING__LOGGING_LEVEL: DEBUG
AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
healthcheck:
test:
- "CMD-SHELL"
- 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
interval: 10s
timeout: 10s
retries: 5
start_period: 40s
networks:
- net
worker_no_healthcheck:
image: apache/airflow:2.2.3-python3.8
command: celery worker
environment:
AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
AIRFLOW__LOGGING__LOGGING_LEVEL: DEBUG
AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
networks:
- net
Operating System
Ubuntu 20.04.3 LTS
Versions of Apache Airflow Providers
Using the base Python 3.8 Docker images from, https://hub.docker.com/layers/apache/airflow/2.2.3-python3.8/images/sha256-a8c86724557a891104e91da8296157b4cabd73d81011ee1f733cbb7bbe61d374?context=explore https://hub.docker.com/layers/apache/airflow/2.1.4-python3.8/images/sha256-d14244034721583a4a2d9760ffc9673307a56be5d8c248df02c466ca86704763?context=explore
Deployment
Docker-Compose is included above.
Deployment details
Tested with Python 3.8 images
Anything else
We did not see similar issues with the Webserver or Scheduler deployments.
I and my colleague think this might be related to some underlying Celery memory leaks. He has informed me of an upcoming release which includes https://github.com/apache/airflow/pull/19703. I’d be interested to see if a similar issue occurs with the newer version.
I don’t believe there’s much else that can be done on Airflow’s part here besides upgrading Celery. I just wanted to bring awareness to this outstanding issue. We are currently in search for a different healthcheck which potentially avoids Celery. If there are suggestions, I would gladly create a PR to update the documented compose file.
Other related issues may be: https://github.com/celery/celery/issues/4843 https://github.com/celery/kombu/pull/1470
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:6 (4 by maintainers)
I’ve posed a general Q&A discussion on the Celery repo, https://github.com/celery/celery/discussions/7327. General feedback has suggested this is potentially related to https://github.com/celery/celery/issues/6009.
still some memory leaks persist in celery