As the title says, i want to setup Airflow that would run on a cluster (1 master, 2 nodes) using Docker swarm.
Current setup:
Right now i have Airflow setup that uses the CeleryExecutor that is running on single EC2.
I have a Dockerfile that pulls Airflow’s image and pip install -r requirements.txt
.
From this Dockerfile I’m creating a local image and this image is used in the docker-compose.yml that spins up the different services Airflow need (webserver, scheduler, redis, flower and some worker. metadb is Postgres that is on a separate RDS).
The docker-compose is used in docker swarm mode ie. docker stack deploy . airflow_stack
Required Setup:
I want to scale the current setup to 3 EC2s (1 master, 2 nodes) that the master would run the webserver, schedule, redis and flower and the workers would run in the nodes.
After searching and web and docs, there are a few things that are still not clear to me that I would love to know
- from what i understand, in order for the nodes to run the workers, the local image that I’m building from the Dockerfile need to be pushed to some repository (if it’s really needed, i would use AWS ECR) for the airflow workers to be able to create the containers from that image. is that correct?
- syncing volumes and env files, right now, I’m mounting the volumes and insert the envs in the docker-compose file. would these mounts and envs be synced to the nodes (and airflow workers containers)? if not, how can make sure that everything is sync as airflow requires that all the components (apart from redis) would have all the dependencies, etc.
- one of the envs that needs to be set when using a CeleryExecuter is the broker_url, how can i make sure that the nodes recognize the redis broker that is on the master
I’m sure that there are a few more things that i forget, but what i wrote is a good start.
Any help or recommendation would be greatly appreciated
Thanks!
Dockerfile:
FROM apache/airflow:2.1.3-python3.9
USER root
RUN apt update;
RUN apt -y install build-essential;
USER airflow
COPY requirements.txt requirements.txt
COPY requirements.airflow.txt requirements.airflow.txt
RUN pip install --upgrade pip;
RUN pip install --upgrade wheel;
RUN pip install -r requirements.airflow.txt
RUN pip install -r requirements.txt
EXPOSE 8793 8786 8787
docker-compose.yml:
version: '3.8'
x-airflow-celery: &airflow-celery
image: local_image:latest
volumes:
-some_volume
env_file:
-some_env_file
services:
webserver:
<<: *airflow-celery
command: airflow webserver
restart: always
ports:
- 80:8080
healthcheck:
test: [ "CMD-SHELL", "[ -f /opt/airflow/airflow-webserver.pid ]" ]
interval: 10s
timeout: 30s
retries: 3
scheduler:
<<: *airflow-celery
command: airflow scheduler
restart: always
deploy:
replicas: 2
redis:
image: redis:6.0
command: redis-server --include /redis.conf
healthcheck:
test: [ "CMD", "redis-cli", "ping" ]
interval: 30s
timeout: 10s
retries: 5
ports:
- 6379:6379
environment:
- REDIS_PORT=6379
worker:
<<: *airflow-celery
command: airflow celery worker
deploy:
replicas: 16
flower:
<<: *airflow-celery
command: airflow celery flower
ports:
- 5555:5555
2
Answers
Sounds like you are heading in the right direction (with one general comment at the end though).
Yes, you need to push image to container registry and refer to it via public (or private if you authenticate) tag. The tag in this case is usally the
registry/name:tag
. For example you can see one of the CI images of Airlfow here: https://github.com/apache/airflow/pkgs/container/airflow%2Fmain%2Fci%2Fpython3.9 – the purpose is a bit different (we use it for our CI builds) but the mechanism is the same: you build it locally, tag with the "registry/image:tag"docker build . --tag registry/image:tag
and rundocker push registry/image:tag
.Then whenever you refer to it from your docker compose, via
registry/image:tag
, docker compose/swarm will pull the right image. Just make sure you make unique TAGs when you build your images to know which image you push (and account for future images).Env files should be fine and they will distribute across the instances, but locally mounted volumes will not. You either need to have some shared filesystem (like NFS, maybe EFS if you use AWS) where the DAGs are stored, or use some other synchronization method to distribute the DAGs. It can be for example git-sync – which has very nice properties especially if you use Git to store the DAG files, or baking DAGs into the image (which requires to re-push images when they change). You can see different options explained in our Helm Chart https://airflow.apache.org/docs/helm-chart/stable/manage-dags-files.html
You cannot use
localhost
you need to set it to a specific host and make sure your broker URL is reachable from all instances. This can be done either by assining specific IP address/DNS name to your ‘broker’ instance and opening up the right ports in firewalls (make sure you control where you can reach thsoe ports from) and maybe even employing some load-balancing.I do not know DockerSwarm well enough how difficult or easy it is to set it all up, but nonestly, that’s kind of a lot of work – it seems – to do it all manually.
I would strongly, really strongly encourage you to use Kubernetes and the Helm Chart which Airlfow community develops: https://airflow.apache.org/docs/helm-chart/stable/index.html . There a lot of issues and necessary configurations either solved in the K8S (scaling, shared filesystems – PVs, networking and connectiviy, resource management etc. etc.) or by our Helm (Git-Sync side containers, broker configuration etc.)
I run Airflow CeleryExecutor on Docker Swarm.
So assuming that you have Docker Swarm set up on your nodes, here are a few things you can do:
To have Airflow read the secrets, I added a simple bash script that gets added to the entrypoint.sh script. So in my stack file I do not need to hardcode any passwords, but if the DOCKER-SECRET string is available, then it will look in
/run/secrets/
(I think I used this as an example when setting it up https://gist.github.com/bvis/b78c1e0841cfd2437f03e20c1ee059fe)In my entrypoint script I add the script that expands Docker Secrets:
source /env_secrets_expand.sh
This is how the postgres image is set up as well, without environment variables:
And for Celery workers, I have my default queue and then a special queue which is pinned to a single node for historical reasons (clients have white listed this specific IP address so I need to ensure that tasks only run on that node). So my entrypoint runs
exec airflow celery "$@" -q "$QUEUE_NAME"
, and my stack file is like this:I use Gitlab CI/CD to deploy my DAGs/plugins whenever I merge to main, and to build the images and deploy the services if I update the Dockerfile or other certain files. I have been running Airflow this way for a few years now (2017 or 2018) but I do plan on switching to Kubernetes eventually since that seems like the more standard approach.