question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Celery Worker docker healthcheck causes a memory leak

See original GitHub issue

Apache Airflow version

2.2.3 (latest released)

What happened

With a docker setup as defined by this compose file, the airflow-worker service healthcheck.test command causes a general increase in memory use overtime, even when idle. This was observed with Airflow 2.1.4 and 2.2.3.

https://github.com/apache/airflow/blob/958860fcd7c9ecdf60b7ebeef4397b348835c8db/docs/apache-airflow/start/docker-compose.yaml#L131-L137

We observed this in our AWS ECS cluster which has a 0.5 CPU/1 GB Mem Worker setup. We strangely had a task fail (at the 2nd dip in memory use of the picture below), which prompted further investigation. The task had actually succeeded, but for some reason notified dependent tasks as failed. Subsequent tasks were marked as upstream failure, but the webserver reported the task as success. We noticed the metrics page looked like the image below. image

We raised the CPU & Memory to 2 CPU / 4 GB Mem and restarted the service, which still produced a gradual increase in memory. image

What you expected to happen

It should not increase in memory when the system is idle, but rather spike during healthcheck and release memory back to the host.

How to reproduce

We use a modified version of the compose file and instead favor docker stack, but the same setup should apply to the documented compose file. A slimmed down compose file is below. It has 2 workers, one with a healthcheck and one without.

A secondary script was written to scrape the docker statistics in 10 second intervals and write them to a CSV file.

Executing both commands can be done like so:

$ docker stack deploy -c docker-compose.yaml airflow
$ nohup ./collect_stats.sh > stats.csv &

The necessary files are below. I’m also including a sample of the CSV file run locally. worker_stats.csv

It shows that over a ~2 hour time period the general increase of airflow_worker_healthcheck. It consumes ~45 MB per hour if the healthcheck occurs at 10 second intervals.

Date Container CPU Percent Mem Usage Mem Percent
2022-01-21T19:17:57UTC airflow_worker_no_healthcheck.1.z3lwru6mh22tzpe6dc8h8ekvi 0.47% 1.108GiB / 14.91GiB 7.43%
2022-01-21T19:17:57UTC airflow_worker_healthcheck.1.vw9d1q18dx75v7w3zcfb7fypt 0.57% 1.1GiB / 14.91GiB 7.38%
2022-01-21T20:34:01UTC airflow_worker_no_healthcheck.1.z3lwru6mh22tzpe6dc8h8ekvi 0.28% 1.108GiB / 14.91GiB 7.43%
2022-01-21T20:34:01UTC airflow_worker_healthcheck.1.vw9d1q18dx75v7w3zcfb7fypt 0.76% 1.157GiB / 14.91GiB 7.76%

collect_stats.sh

#!/usr/bin/env sh

echo "Date,Container,CPU Percent,Mem Usage,Mem Percent"
while true; do
    time=$(date --utc +%FT%T%Z)
    docker stats \
      --format "table {{.Name}},{{.CPUPerc}},{{.MemUsage}},{{.MemPerc}}" \
      --no-stream \
      | grep worker \
      | awk -vT="${time}," '{ print T $0 }'
    sleep 10
done

docker-compose.yaml

---
version: '3.7'

networks:
  net:
    driver: overlay
    attachable: true

volumes:
  postgres-data:
  redis-data:

services:
  postgres:
    image: postgres:13.2-alpine
    volumes:
      - postgres-data:/var/lib/postgresql/data
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    healthcheck:
      test: pg_isready -U airflow -d airflow
      interval: 10s
      timeout: 3s
      start_period: 15s
    ports:
      - '5432:5432'
    networks:
      - net

  redis:
    image: redis:6.2
    volumes:
      - redis-data:/data
    healthcheck:
      test: redis-cli ping
      interval: 10s
      timeout: 3s
      start_period: 15s
    ports:
      - '6379:6379'
    networks:
      - net

  webserver:
    image: apache/airflow:2.2.3-python3.8
    command:
      - bash
      - -c
      - 'airflow db init
      && airflow db upgrade
      && airflow users create --username admin --firstname Admin --lastname User --password admin --role Admin --email test@admin.org
      && airflow webserver'
    environment:
      AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
      AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__LOGGING__LOGGING_LEVEL: INFO
      AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
    healthcheck:
      test: curl --fail http://localhost:8080/health
      interval: 10s
      timeout: 10s
      retries: 10
      start_period: 90s
    ports:
      - '8080:8080'
    networks:
      - net

  scheduler:
    image: apache/airflow:2.2.3-python3.8
    command: scheduler
    environment:
      AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
      AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__LOGGING__LOGGING_LEVEL: INFO
      AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
    healthcheck:
      test: airflow db check
      interval: 20s
      timeout: 10s
      retries: 5
      start_period: 40s
    networks:
      - net

  worker_healthcheck:
    image: apache/airflow:2.2.3-python3.8
    command: celery worker
    environment:
      AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
      AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__LOGGING__LOGGING_LEVEL: DEBUG
      AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
    healthcheck:
      test:
        - "CMD-SHELL"
        - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
      interval: 10s
      timeout: 10s
      retries: 5
      start_period: 40s
    networks:
      - net

  worker_no_healthcheck:
    image: apache/airflow:2.2.3-python3.8
    command: celery worker
    environment:
      AIRFLOW__API__AUTH_BACKEND: airflow.api.auth.backend.basic_auth
      AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/1
      AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__CORE__EXECUTOR: CeleryExecutor
      AIRFLOW__CORE__FERNET_KEY: yxfSDUw_7SG6BhBstIt7dFzL5rpnxvr_Jkv0tFyEJ3s=
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql://airflow:airflow@postgres:5432/airflow
      AIRFLOW__LOGGING__LOGGING_LEVEL: DEBUG
      AIRFLOW__WEBSERVER__SECRET_KEY: 0123456789
    networks:
      - net

Operating System

Ubuntu 20.04.3 LTS

Versions of Apache Airflow Providers

Using the base Python 3.8 Docker images from, https://hub.docker.com/layers/apache/airflow/2.2.3-python3.8/images/sha256-a8c86724557a891104e91da8296157b4cabd73d81011ee1f733cbb7bbe61d374?context=explore https://hub.docker.com/layers/apache/airflow/2.1.4-python3.8/images/sha256-d14244034721583a4a2d9760ffc9673307a56be5d8c248df02c466ca86704763?context=explore

Deployment

Docker-Compose is included above.

Deployment details

Tested with Python 3.8 images

Anything else

We did not see similar issues with the Webserver or Scheduler deployments.

I and my colleague think this might be related to some underlying Celery memory leaks. He has informed me of an upcoming release which includes https://github.com/apache/airflow/pull/19703. I’d be interested to see if a similar issue occurs with the newer version.

I don’t believe there’s much else that can be done on Airflow’s part here besides upgrading Celery. I just wanted to bring awareness to this outstanding issue. We are currently in search for a different healthcheck which potentially avoids Celery. If there are suggestions, I would gladly create a PR to update the documented compose file.

Other related issues may be: https://github.com/celery/celery/issues/4843 https://github.com/celery/kombu/pull/1470

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
mtraynhamcommented, Mar 1, 2022

I’ve posed a general Q&A discussion on the Celery repo, https://github.com/celery/celery/discussions/7327. General feedback has suggested this is potentially related to https://github.com/celery/celery/issues/6009.

0reactions
auvipycommented, Mar 6, 2022

still some memory leaks persist in celery

Read more comments on GitHub >

github_iconTop Results From Across the Web

Celery memory leak with Airflow healthcheck - different health ...
I've observed a Celery memory leak when running Airflow 2.2.4/Celery 5.2.3 in a Docker container. I've tracked it down to the health check...
Read more >
[GitHub] [airflow] mtraynham edited a comment on issue #21026 ...
... comment on issue #21026: Celery Worker docker healthcheck causes a memory leak ... 2.2.4 and still see this issue with the recommended...
Read more >
Self-hosted instance constantly crashing since latest update
Your logs etc have confirmed that the backend and celery workers do have some sort of memory leak, but the tooling/steps to actually...
Read more >
Large celery task memory leak - python - Stack Overflow
The memory leak appears to correspond with the chunk_size; if I increase the chunk_size, the memory consumption increases per-print. This seems ...
Read more >
Memory Leak in Celery - Go slowly
My first attempt was set hard limit for container memory to 1GiB. And guess what? Celery will consume 99.9% of that limit then...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found