question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TaskInstances do not succeed when using enable_logging=True option in DockerSwarmOperator

See original GitHub issue

Apache Airflow version: v2.0.0 Git Version: release:2.0.0+ab5f770bfcd8c690cbe4d0825896325aca0beeca

Docker version: Docker version 20.10.1, build 831ebeae96

Environment:

  • Cloud provider or hardware configuration: local setup, docker engine in swarm mode, docker stack deploy
  • OS (e.g. from /etc/os-release): Manjaro Linux
  • Kernel (e.g. uname -a): 5.9.11
  • Install tools:
    • docker airflow image apache/airflow:2.0.0-python3.8 (hash fe4a64af9553)
  • Others:

What happened:

When using DockerSwarmOperator (either contrib or providers module) together with the default enable_logging=True option, tasks do not succeed and stay in state running. When checking the docker service logs I can clearly see that the container ran and ended successfully. Airflow however does not recognize that the container finished and keeps the tasks in state running.

However, when using enable_logging=False AND auto_remove=False containers are recognized as finished and tasks are correctly in state success. When using enable_logging=False and auto_remove=True I get the following error message

{taskinstance.py:1396} ERROR - 404 Client Error: Not Found ("service 936om1s4zso10ye5ferhvwnxn not found")

What you expected to happen:

When I run a DAG with DockerSwarmOperators in it I expect that docker containers are distributed to the docker swarm and that container logs and states are correctly tracked by the DockerSwarmOperator. Meaning, with enable_logging=True option I would expect that the TaskInstance’s log contains the logging output of the docker container/service. Furthermore, when using the auto_remove=True option I would expect that docker services are removed after the TaskInstance is finished successfully.

It looks like something is broken with the enable_logging and auto_remove=True options.

How to reproduce it:

Dockerfile

FROM apache/airflow:2.0.0-python3.8

ARG DOCKER_GROUP_ID

USER root

RUN groupadd --gid $DOCKER_GROUP_ID docker \
    && usermod -aG docker airflow

USER airflow

airflow user needs to be in the docker group to have access to the docker daemon

build the Dockerfile

docker build --build-arg DOCKER_GROUP_ID=$(getent group docker | awk -F: '{print $3}') -t docker-swarm-bug .

docker-stack.yml

version: "3.2"
networks:
  airflow:

services:
  postgres:
    image: postgres:13.1
    environment:
      - POSTGRES_USER=airflow
      - POSTGRES_DB=airflow
      - POSTGRES_PASSWORD=airflow
      - PGDATA=/var/lib/postgresql/data/pgdata
    ports:
      - 5432:5432
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./database/data:/var/lib/postgresql/data/pgdata
      - ./database/logs:/var/lib/postgresql/data/log
    command: >
      postgres
        -c listen_addresses=*
        -c logging_collector=on
        -c log_destination=stderr
        -c max_connections=200
    networks:
      - airflow
  redis:
    image: redis:5.0.5
    environment:
      REDIS_HOST: redis
      REDIS_PORT: 6379
    ports:
      - 6379:6379
    networks:
      - airflow
  webserver:
    env_file:
      - .env
    image: docker-swarm-bug:latest
    ports:
      - 8080:8080
    volumes:
      - ./airflow_files/dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./files:/opt/airflow/files
      - /var/run/docker.sock:/var/run/docker.sock
    deploy:
      restart_policy:
        condition: on-failure
        delay: 8s
        max_attempts: 3
    depends_on:
      - postgres
      - redis
    command: webserver
    healthcheck:
      test: ["CMD-SHELL", "[ -f /opt/airflow/airflow-webserver.pid ]"]
      interval: 30s
      timeout: 30s
      retries: 3
    networks:
      - airflow
  flower:
    image: docker-swarm-bug:latest
    env_file:
      - .env
    ports:
      - 5555:5555
    depends_on:
      - redis
    deploy:
      restart_policy:
        condition: on-failure
        delay: 8s
        max_attempts: 3
    volumes:
      - ./logs:/opt/airflow/logs
    command: celery flower
    networks:
      - airflow
  scheduler:
    image: docker-swarm-bug:latest
    env_file:
      - .env
    volumes:
      - ./airflow_files/dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./files:/opt/airflow/files
      - /var/run/docker.sock:/var/run/docker.sock
    command: scheduler
    deploy:
      restart_policy:
        condition: on-failure
        delay: 8s
        max_attempts: 3
    networks:
      - airflow
  worker:
    image: docker-swarm-bug:latest
    env_file:
      - .env
    volumes:
      - ./airflow_files/dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./files:/opt/airflow/files
      - /var/run/docker.sock:/var/run/docker.sock
    command: celery worker
    depends_on:
      - scheduler

    deploy:
      restart_policy:
        condition: on-failure
        delay: 8s
        max_attempts: 3
    networks:
      - airflow
  initdb:
    image: docker-swarm-bug:latest
    env_file:
      - .env
    volumes:
      - ./airflow_files/dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./files:/opt/airflow/files
      - /var/run/docker.sock:/var/run/docker.sock
    entrypoint: /bin/bash
    deploy:
      restart_policy:
        condition: on-failure
        delay: 8s
        max_attempts: 5
    command: -c "airflow db init && airflow users create --firstname admin --lastname admin --email admin --password admin --username admin --role Admin"
    depends_on:
      - redis
      - postgres
    networks:
      - airflow

docker_swarm_bug.py

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.providers.docker.operators.docker_swarm import DockerSwarmOperator
# you can also try DockerSwarmOperator from contrib module, shouldn't make a difference
# from airflow.contrib.operators.docker_swarm_operator import DockerSwarmOperator

default_args = {
    "owner": "airflow",
    "start_date": "2021-01-14"
}

with DAG(
    "docker_swarm_bug", default_args=default_args, schedule_interval="@once"
) as dag:
    start_op = BashOperator(
        task_id="start_op", bash_command="echo start testing multiple dockers",
    )

    docker_swarm = list()
    for i in range(16):
        docker_swarm.append(
            DockerSwarmOperator(
                task_id=f"docker_swarm_{i}",
                image="hello-world:latest",
                force_pull=True,
                auto_remove=True,
                api_version="auto",
                docker_url="unix://var/run/docker.sock",
                network_mode="bridge",
                enable_logging=False,
            )
        )

    finish_op = BashOperator(
        task_id="finish_op", bash_command="echo finish testing multiple dockers",
    )

    start_op >> docker_swarm >> finish_op

create directories, copy DAG and set permissions

mkdir -p airflow_files/dags
cp docker_swarm_bug.py airflow_files/dags/
mkdir logs
mkdir files
sudo chown -R 50000 airflow_files logs files

uid 50000 is the id of the airflow user inside the docker images

deploy docker-stack.yml

docker stack deploy --compose-file docker-stack.yml airflow

trigger DAG docker_swarm_bug in UI

Anything else we need to know:

Problem occurs with the options enable_logging=True.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
alexcolpitts96commented, Jan 12, 2022

@eladkal Sorry to bump an old issue, but it seems to persist with version release:2.2.3+06c82e17e9d7ff1bf261357e84c6013ccdb3c241

Containers are spawned, complete successfully, are removed, and Airflow does not mark them as completed if enable_logging=True

0reactions
potiukcommented, Jun 21, 2022

@eladkal Sorry to bump an old issue, but it seems to persist with version release:2.2.3+06c82e17e9d7ff1bf261357e84c6013ccdb3c241

Indeed. You should not do it.

Please @alexcolpitts96 @FriedrichSal open new issues with detailed description of your circumstances, logs and reproduction cases. Commenting on an old, closed issues (and especially “I have the same issue”) adds precisely 0 value without logs and details). Please watch my talk from the Summit to understand why https://www.youtube.com/watch?v=G6VjYvKr2wQ&list=PLGudixcDaxY2LxjeHpZRtzq7miykjjFOn&index=54

Read more comments on GitHub >

github_iconTop Results From Across the Web

TaskInstances do not succeed when using enable_logging ...
When using DockerSwarmOperator together with the default enable_logging=True option, tasks do not succeed and stay in state running.
Read more >
apache-airflow-providers-docker Documentation
This is a provider package for docker provider. All classes for this provider package are in airflow.providers.docker python package.
Read more >
How can I used the DockerOperator in Airflow, of I am ... - Reddit
Context I am running Airflow, and trying to run a proof of concept for a Docker container using Airflow's DockerOperator .
Read more >
Airflow DockerOperator mounts cause an error in docker ...
I haven't tried myself but does this help? You are currently passing list of strings to mounts but according to the documentation, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found