What I have:
I run celery with rabbitMQ as a broker and redis as the result backend. I have an app that sends tasks and workers that process the tasks.
I deployed this as follow:
- The app, redis, rabbitMQ and a worker (let’s call him ‘local_worker’) are running on an azure VM using a docker-compose so I use the docker version of rabbitMQ and redis (6.2.5). rabbitMQ and redis ports are open on the VM and those containers are configured with username and password.
- I add workers using azure container instances that connects to the redis and rabbitMQ running on the VM.
First if you have recommendation on this architecture I would be glad to get advice.
The problem:
Everything works well, the tasks are dispatched on the different workers which send back the results etc etc…
When a task is sent after 30 minutes with no task running I observe redis latency of 2 minutes when the task is not sent to the ‘local_worker’.
- I know this must come from redis because I can see the logs of the task in the worker container instance immediately after sending the task.
- I monitor this architecture with flower and graphana with celery prometheus exporter so I can monitor the latency of the tasks. On flower the latent task stays with the ‘processing’ status.
- There is exactly 120 seconds more on a task which is the first one after a no task interval and which is not processed by the ‘local_worker’.
- This does not happen when the task is processed by the ‘local_worker’ which runs on the same VM as redis.
It is like redis or the VM was sleeping for 2 minutes before sending back the result. As it is exactly 120 seconds (2 minutes) I expect that it is something wanted by redis, celery or azure (something deterministic)
I don’t use a redis conf file, only default settings (except for the password) to run the redis server.
Thanks for your help and feedback on my architecture and problems.
Here is a screenshot of what I see in flower. The three tasks are the same (removing a directory).
The first and the third tasks have been processed by the local worker. The second one has been processed by an external worker. On the logs of the external worker I put a print line just befor returning the results and this line has been printed at 14:14:23. So there has been 120 seconds between this print and the official end of the task.
EDIT:
I found that the default value for redis_socket_timeout was 120 seconds.
I removed the line redis_retry_on_timeout = True
and added the line redis_socket_keepalive = True
in my celery config file. Now the error I get is that the task failed with redis.exceptions.TimeoutError: Timeout reading from socket
.
I don’t know why the socket times out whereas the result is ready. Is it a problem with the network of my container instance?
Here is my docker-compose:
version: "3.5"
services:
rabbitmq:
image: rabbitmq:3.8-management
restart: always
ports:
- 5672:5672
labels:
- traefik.enable=true
- traefik.http.services.rabbitmq-ui.loadbalancer.server.port=15672
- traefik.http.routers.rabbitmq-ui-http.entrypoints=http
- traefik.http.routers.rabbitmq-ui-http.rule=(Host(`rabbitmq.${HOSTNAME?Variable not set}.example.app`))
- traefik.docker.network=traefik-public
- traefik.http.routers.rabbitmq-ui-https.entrypoints=https
- traefik.http.routers.rabbitmq-ui-https.rule=Host(`rabbitmq.${HOSTNAME?Variable not set}.example.app`)
- traefik.http.routers.rabbitmq-ui-https.tls=true
- traefik.http.routers.rabbitmq-ui-https.tls.certresolver=le
- traefik.http.routers.rabbitmq-ui-http.middlewares=https-redirect
env_file:
- .env
environment:
- RABBITMQ_DEFAULT_USER=${RABBITMQ_DEFAULT_USER}
- RABBITMQ_DEFAULT_PASS=${RABBITMQ_DEFAULT_PASS}
networks:
- traefik-public
redis:
image: redis:6.2.5
restart: always
command: ["redis-server", "--requirepass", "${RABBITMQ_DEFAULT_PASS:-password}"]
ports:
- 6379:6379
networks:
- traefik-public
flower:
image: mher/flower:0.9.5
restart: always
labels:
- traefik.enable=true
- traefik.http.services.flower-ui.loadbalancer.server.port=5555
- traefik.http.routers.flower-ui-http.entrypoints=http
- traefik.http.routers.flower-ui-http.rule=Host(`flower.${HOSTNAME?Variable not set}.example.app`)
- traefik.docker.network=traefik-public
- traefik.http.routers.flower-ui-https.entrypoints=https
- traefik.http.routers.flower-ui-https.rule=Host(`flower.${HOSTNAME?Variable not set}.example.app`)
- traefik.http.routers.flower-ui-https.tls=true
- traefik.http.routers.flower-ui-https.tls.certresolver=le
- traefik.http.routers.flower-ui-http.middlewares=https-redirect
- traefik.http.routers.flower-ui-https.middlewares=traefik-admin-auth
env_file:
- .env
command:
- "--broker=amqp://${RABBITMQ_DEFAULT_USER:-guest}:${RABBITMQ_DEFAULT_PASS:-guest}@rabbitmq:5672//"
depends_on:
- rabbitmq
- redis
networks:
- traefik-public
local_worker:
build:
context: ..
dockerfile: ./setup/devops/docker/app.dockerfile
image: swtools:app
restart: always
volumes:
- ${SWTOOLSWORKINGDIR:-/tmp}:${SWTOOLSWORKINGDIR:-/tmp}
command: ["celery", "--app=app.worker.celery_app:celery_app", "worker", "-n", "local_worker@%h"]
env_file:
- .env
environment:
- RABBITMQ_DEFAULT_USER=${RABBITMQ_DEFAULT_USER}
- RABBITMQ_DEFAULT_PASS=${RABBITMQ_DEFAULT_PASS}
- RABBITMQ_HOST=rabbitmq
- REDIS_HOST=${HOSTNAME?Variable not set}
depends_on:
- rabbitmq
- redis
networks:
- traefik-public
dashboard_app:
image: swtools:app
restart: always
labels:
- traefik.enable=true
- traefik.http.services.dash-app.loadbalancer.server.port=${DASH_PORT-8080}
- traefik.http.routers.dash-app-http.entrypoints=http
- traefik.http.routers.dash-app-http.rule=Host(`dashboard.${HOSTNAME?Variable not set}.example.app`)
- traefik.docker.network=traefik-public
- traefik.http.routers.dash-app-https.entrypoints=https
- traefik.http.routers.dash-app-https.rule=Host(`dashboard.${HOSTNAME?Variable not set}.example.app`)
- traefik.http.routers.dash-app-https.tls=true
- traefik.http.routers.dash-app-https.tls.certresolver=le
- traefik.http.routers.dash-app-http.middlewares=https-redirect
- traefik.http.middlewares.operator-auth.basicauth.users=${OPERATOR_USERNAME?Variable not set}:${HASHED_OPERATOR_PASSWORD?Variable not set}
- traefik.http.routers.dash-app-https.middlewares=operator-auth
volumes:
- ${SWTOOLSWORKINGDIR:-/tmp}:${SWTOOLSWORKINGDIR:-/tmp}
command: ['waitress-serve', '--port=${DASH_PORT:-8080}', 'app.order_dashboard:app.server']
env_file:
- .env
environment:
- RABBITMQ_DEFAULT_USER=${RABBITMQ_DEFAULT_USER}
- RABBITMQ_DEFAULT_PASS=${RABBITMQ_DEFAULT_PASS}
- RABBITMQ_HOST=rabbitmq
- REDIS_HOST=${HOSTNAME?Variable not set}
networks:
- traefik-public
depends_on:
- rabbitmq
- redis
networks:
traefik-public:
external: true
and my celery config file:
import os
import warnings
from pathlib import Path
# result backend use redis
result_backend_host = os.getenv('REDIS_HOST', 'localhost')
result_backend_pass = os.getenv('REDIS_PASS', 'password')
result_backend = 'redis://:{password}@{host}:6379/0'.format(password=result_backend_pass, host=result_backend_host)
# redis_retry_on_timeout = True
redis_socket_keepalive = True
# broker use rabbitmq
rabbitmq_user = os.getenv('RABBITMQ_DEFAULT_USER', 'guest')
rabbitmq_pass = os.getenv('RABBITMQ_DEFAULT_PASS', 'guest')
rabbitmq_host = os.getenv('RABBITMQ_HOST', 'localhost')
broker_url = 'amqp://{user}:{password}@{host}:5672//'.format(user=rabbitmq_user, password=rabbitmq_pass, host=rabbitmq_host)
include = ['app.worker.tasks', 'app.dashboard.example1', 'app.dashboard.example2']
#task events
worker_send_task_events = True
task_send_sent_event = True
All the env variables are defined and it works well except my socket timeout problem! When I deploy a new worker on a container instance, I set the env variables so it connects to the rabbitmq and redis running on the docker-compose.
Here is my celery file that defines the celery app:
from celery import Celery
from app.worker import celery_config
celery_app = Celery()
celery_app.config_from_object(celery_config)
2
Answers
Finally changing the backend to rpc resolved the problem. I tried different things with redis that didn't work. A way to dig would be to inspect the sockets with tcp-dump to see where it blocks but I didn't try that I my problem was solved with the rpc backend.
I guess that you have some firewall between your Redis instance and your worker.
Can you login to that
SandboxHost...
and ensure that you can connect your redis?You can do that with
telnet
, for example:or with redis-cli:
EDIT:
Seems like you’re missing result_backend:
and make sure that your
REDIS_HOST=${HOSTNAME?Variable not set}
is valid…EDIT2:
Can you add the
bind
to your Redis command:Please be aware of its security implications!