skip to Main Content

As the title says, i want to setup Airflow that would run on a cluster (1 master, 2 nodes) using Docker swarm.

Current setup:

Right now i have Airflow setup that uses the CeleryExecutor that is running on single EC2.
I have a Dockerfile that pulls Airflow’s image and pip install -r requirements.txt.
From this Dockerfile I’m creating a local image and this image is used in the docker-compose.yml that spins up the different services Airflow need (webserver, scheduler, redis, flower and some worker. metadb is Postgres that is on a separate RDS).
The docker-compose is used in docker swarm mode ie. docker stack deploy . airflow_stack

Required Setup:

I want to scale the current setup to 3 EC2s (1 master, 2 nodes) that the master would run the webserver, schedule, redis and flower and the workers would run in the nodes.
After searching and web and docs, there are a few things that are still not clear to me that I would love to know

  1. from what i understand, in order for the nodes to run the workers, the local image that I’m building from the Dockerfile need to be pushed to some repository (if it’s really needed, i would use AWS ECR) for the airflow workers to be able to create the containers from that image. is that correct?
  2. syncing volumes and env files, right now, I’m mounting the volumes and insert the envs in the docker-compose file. would these mounts and envs be synced to the nodes (and airflow workers containers)? if not, how can make sure that everything is sync as airflow requires that all the components (apart from redis) would have all the dependencies, etc.
  3. one of the envs that needs to be set when using a CeleryExecuter is the broker_url, how can i make sure that the nodes recognize the redis broker that is on the master

I’m sure that there are a few more things that i forget, but what i wrote is a good start.
Any help or recommendation would be greatly appreciated

Thanks!

Dockerfile:

FROM apache/airflow:2.1.3-python3.9
USER root


RUN apt update;
RUN apt -y install build-essential;

USER airflow
COPY requirements.txt requirements.txt
COPY requirements.airflow.txt requirements.airflow.txt

RUN pip install --upgrade pip;
RUN pip install --upgrade wheel;

RUN pip install -r requirements.airflow.txt
RUN pip install -r requirements.txt


EXPOSE 8793 8786 8787

docker-compose.yml:

version: '3.8'
x-airflow-celery: &airflow-celery
  image: local_image:latest
  volumes:
    -some_volume
  env_file:
    -some_env_file

services:
  webserver:
    <<: *airflow-celery
    command: airflow webserver
    restart: always
    ports:
      - 80:8080
    healthcheck:
      test: [ "CMD-SHELL", "[ -f /opt/airflow/airflow-webserver.pid ]" ]
      interval: 10s
      timeout: 30s
      retries: 3

  scheduler:
    <<: *airflow-celery
    command: airflow scheduler
    restart: always
    deploy:
      replicas: 2

  redis:
    image: redis:6.0
    command: redis-server --include /redis.conf
    healthcheck:
      test: [ "CMD", "redis-cli", "ping" ]
      interval: 30s
      timeout: 10s
      retries: 5
    ports:
      - 6379:6379
    environment:
      - REDIS_PORT=6379

  worker:
    <<: *airflow-celery
    command: airflow celery worker
    deploy:
      replicas: 16

  flower:
    <<: *airflow-celery
    command: airflow celery flower
    ports:
      - 5555:5555

2

Answers


  1. Sounds like you are heading in the right direction (with one general comment at the end though).

    1. Yes, you need to push image to container registry and refer to it via public (or private if you authenticate) tag. The tag in this case is usally the registry/name:tag. For example you can see one of the CI images of Airlfow here: https://github.com/apache/airflow/pkgs/container/airflow%2Fmain%2Fci%2Fpython3.9 – the purpose is a bit different (we use it for our CI builds) but the mechanism is the same: you build it locally, tag with the "registry/image:tag" docker build . --tag registry/image:tag and run docker push registry/image:tag.
      Then whenever you refer to it from your docker compose, via registry/image:tag, docker compose/swarm will pull the right image. Just make sure you make unique TAGs when you build your images to know which image you push (and account for future images).

    2. Env files should be fine and they will distribute across the instances, but locally mounted volumes will not. You either need to have some shared filesystem (like NFS, maybe EFS if you use AWS) where the DAGs are stored, or use some other synchronization method to distribute the DAGs. It can be for example git-sync – which has very nice properties especially if you use Git to store the DAG files, or baking DAGs into the image (which requires to re-push images when they change). You can see different options explained in our Helm Chart https://airflow.apache.org/docs/helm-chart/stable/manage-dags-files.html

    3. You cannot use localhost you need to set it to a specific host and make sure your broker URL is reachable from all instances. This can be done either by assining specific IP address/DNS name to your ‘broker’ instance and opening up the right ports in firewalls (make sure you control where you can reach thsoe ports from) and maybe even employing some load-balancing.

    I do not know DockerSwarm well enough how difficult or easy it is to set it all up, but nonestly, that’s kind of a lot of work – it seems – to do it all manually.

    I would strongly, really strongly encourage you to use Kubernetes and the Helm Chart which Airlfow community develops: https://airflow.apache.org/docs/helm-chart/stable/index.html . There a lot of issues and necessary configurations either solved in the K8S (scaling, shared filesystems – PVs, networking and connectiviy, resource management etc. etc.) or by our Helm (Git-Sync side containers, broker configuration etc.)

    Login or Signup to reply.
  2. I run Airflow CeleryExecutor on Docker Swarm.

    So assuming that you have Docker Swarm set up on your nodes, here are a few things you can do:

    1. Map shared volumes to NFS folders like this (same for plugins and logs, or anything else you need to share)
    volumes:
      dags:
        driver_opts:
          type: "none"
          o: "bind"
          device: "/nfs/airflow/dags"
    
    1. I personally use Docker Secrets to handle my webserver password, database password, etc. (similarly, I use Docker configs to pass in my celery and webserver config)
    secrets:
      postgresql_password:
        external: true
      fernet_key:
        external: true
      webserver_password:
        external: true
    

    To have Airflow read the secrets, I added a simple bash script that gets added to the entrypoint.sh script. So in my stack file I do not need to hardcode any passwords, but if the DOCKER-SECRET string is available, then it will look in /run/secrets/ (I think I used this as an example when setting it up https://gist.github.com/bvis/b78c1e0841cfd2437f03e20c1ee059fe)

    In my entrypoint script I add the script that expands Docker Secrets:

    source /env_secrets_expand.sh

    x-airflow-variables: &airflow-variables
        AIRFLOW__CORE__EXECUTOR: CeleryExecutor
        ...
        AIRFLOW__WEBSERVER__SECRET_KEY: DOCKER-SECRET->webserver_secret_key
    
    

    This is how the postgres image is set up as well, without environment variables:

    services:
      postgres:
        image: postgres:11.5
        secrets:
          - source: postgresql_password
            target: /run/secrets/postgresql_password
        environment:
          - POSTGRES_USER=airflow
          - POSTGRES_DB=airflow
          - POSTGRES_PASSWORD_FILE=/run/secrets/postgresql_password
    
    1. You can obviously use Swarm labels or hostnames to determine which nodes a certain service should run
      scheduler:
        <<: *airflow-common
        environment: *airflow-variables
        command: scheduler
        deploy:
          replicas: 2
          mode: replicated
          placement:
            constraints:
              - node.labels.type == worker
          restart_policy:
            condition: on-failure
            delay: 5s
            max_attempts: 3
            window: 120s
        logging:
          driver: fluentd
          options:
            tag: docker.airflow.scheduler
            fluentd-async-connect: "true"
    

    And for Celery workers, I have my default queue and then a special queue which is pinned to a single node for historical reasons (clients have white listed this specific IP address so I need to ensure that tasks only run on that node). So my entrypoint runs exec airflow celery "$@" -q "$QUEUE_NAME", and my stack file is like this:

      worker_default:
        <<: *airflow-common
        environment:
          <<: *airflow-variables
          QUEUE_NAME: default
        command: worker
        deploy:
          replicas: 3
          mode: replicated
          placement:
            constraints:
              - node.labels.type == worker
    
      worker_nodename:
        <<: *airflow-common
        environment:
          <<: *airflow-variables
          QUEUE_NAME: nodename
        command: worker
        deploy:
          replicas: 1
          mode: replicated
          placement:
            constraints:
              - node.hostname == nodename
    
    

    I use Gitlab CI/CD to deploy my DAGs/plugins whenever I merge to main, and to build the images and deploy the services if I update the Dockerfile or other certain files. I have been running Airflow this way for a few years now (2017 or 2018) but I do plan on switching to Kubernetes eventually since that seems like the more standard approach.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search