skip to Main Content

What I have:

I run celery with rabbitMQ as a broker and redis as the result backend. I have an app that sends tasks and workers that process the tasks.

I deployed this as follow:

  • The app, redis, rabbitMQ and a worker (let’s call him ‘local_worker’) are running on an azure VM using a docker-compose so I use the docker version of rabbitMQ and redis (6.2.5). rabbitMQ and redis ports are open on the VM and those containers are configured with username and password.
  • I add workers using azure container instances that connects to the redis and rabbitMQ running on the VM.

First if you have recommendation on this architecture I would be glad to get advice.

The problem:

Everything works well, the tasks are dispatched on the different workers which send back the results etc etc…

When a task is sent after 30 minutes with no task running I observe redis latency of 2 minutes when the task is not sent to the ‘local_worker’.

  • I know this must come from redis because I can see the logs of the task in the worker container instance immediately after sending the task.
  • I monitor this architecture with flower and graphana with celery prometheus exporter so I can monitor the latency of the tasks. On flower the latent task stays with the ‘processing’ status.
  • There is exactly 120 seconds more on a task which is the first one after a no task interval and which is not processed by the ‘local_worker’.
  • This does not happen when the task is processed by the ‘local_worker’ which runs on the same VM as redis.

It is like redis or the VM was sleeping for 2 minutes before sending back the result. As it is exactly 120 seconds (2 minutes) I expect that it is something wanted by redis, celery or azure (something deterministic)

I don’t use a redis conf file, only default settings (except for the password) to run the redis server.

Thanks for your help and feedback on my architecture and problems.

Here is a screenshot of what I see in flower. The three tasks are the same (removing a directory).
My flower view of tasks that have been sent with a 25 minutes delay (the three tasks are the same).

The first and the third tasks have been processed by the local worker. The second one has been processed by an external worker. On the logs of the external worker I put a print line just befor returning the results and this line has been printed at 14:14:23. So there has been 120 seconds between this print and the official end of the task.

EDIT:

I found that the default value for redis_socket_timeout was 120 seconds.

I removed the line redis_retry_on_timeout = True and added the line redis_socket_keepalive = True in my celery config file. Now the error I get is that the task failed with redis.exceptions.TimeoutError: Timeout reading from socket.
I don’t know why the socket times out whereas the result is ready. Is it a problem with the network of my container instance?

Here is my docker-compose:

version: "3.5"
services:

  rabbitmq:
    image: rabbitmq:3.8-management
    restart: always
    ports:
      - 5672:5672
    labels:
      - traefik.enable=true
      - traefik.http.services.rabbitmq-ui.loadbalancer.server.port=15672
      - traefik.http.routers.rabbitmq-ui-http.entrypoints=http
      - traefik.http.routers.rabbitmq-ui-http.rule=(Host(`rabbitmq.${HOSTNAME?Variable not set}.example.app`))
      - traefik.docker.network=traefik-public
      - traefik.http.routers.rabbitmq-ui-https.entrypoints=https
      - traefik.http.routers.rabbitmq-ui-https.rule=Host(`rabbitmq.${HOSTNAME?Variable not set}.example.app`)
      - traefik.http.routers.rabbitmq-ui-https.tls=true
      - traefik.http.routers.rabbitmq-ui-https.tls.certresolver=le
      - traefik.http.routers.rabbitmq-ui-http.middlewares=https-redirect
    env_file:
      - .env
    environment:
      - RABBITMQ_DEFAULT_USER=${RABBITMQ_DEFAULT_USER}
      - RABBITMQ_DEFAULT_PASS=${RABBITMQ_DEFAULT_PASS}

    networks:
      - traefik-public


  redis:
    image: redis:6.2.5
    restart: always
    command: ["redis-server", "--requirepass", "${RABBITMQ_DEFAULT_PASS:-password}"]
    ports:
      - 6379:6379
    networks:
      - traefik-public

  flower:
    image: mher/flower:0.9.5
    restart: always
    labels:
      - traefik.enable=true
      - traefik.http.services.flower-ui.loadbalancer.server.port=5555
      - traefik.http.routers.flower-ui-http.entrypoints=http
      - traefik.http.routers.flower-ui-http.rule=Host(`flower.${HOSTNAME?Variable not set}.example.app`)
      - traefik.docker.network=traefik-public
      - traefik.http.routers.flower-ui-https.entrypoints=https
      - traefik.http.routers.flower-ui-https.rule=Host(`flower.${HOSTNAME?Variable not set}.example.app`)
      - traefik.http.routers.flower-ui-https.tls=true
      - traefik.http.routers.flower-ui-https.tls.certresolver=le
      - traefik.http.routers.flower-ui-http.middlewares=https-redirect

      - traefik.http.routers.flower-ui-https.middlewares=traefik-admin-auth

    env_file:
      - .env
    command:
      - "--broker=amqp://${RABBITMQ_DEFAULT_USER:-guest}:${RABBITMQ_DEFAULT_PASS:-guest}@rabbitmq:5672//"
    depends_on:
      - rabbitmq
      - redis

    networks:
      - traefik-public

  local_worker:
    build:
      context: ..
      dockerfile: ./setup/devops/docker/app.dockerfile
    image: swtools:app
    restart: always
    volumes:
      - ${SWTOOLSWORKINGDIR:-/tmp}:${SWTOOLSWORKINGDIR:-/tmp}
    command: ["celery", "--app=app.worker.celery_app:celery_app", "worker", "-n", "local_worker@%h"]
    env_file:
      - .env
    environment:
      - RABBITMQ_DEFAULT_USER=${RABBITMQ_DEFAULT_USER}
      - RABBITMQ_DEFAULT_PASS=${RABBITMQ_DEFAULT_PASS}
      - RABBITMQ_HOST=rabbitmq
      - REDIS_HOST=${HOSTNAME?Variable not set}
    depends_on:
      - rabbitmq
      - redis
    networks:
      - traefik-public

  dashboard_app:
    image: swtools:app
    restart: always
    labels:
      - traefik.enable=true
      - traefik.http.services.dash-app.loadbalancer.server.port=${DASH_PORT-8080}
      - traefik.http.routers.dash-app-http.entrypoints=http
      - traefik.http.routers.dash-app-http.rule=Host(`dashboard.${HOSTNAME?Variable not set}.example.app`)
      - traefik.docker.network=traefik-public
      - traefik.http.routers.dash-app-https.entrypoints=https
      - traefik.http.routers.dash-app-https.rule=Host(`dashboard.${HOSTNAME?Variable not set}.example.app`)
      - traefik.http.routers.dash-app-https.tls=true
      - traefik.http.routers.dash-app-https.tls.certresolver=le
      - traefik.http.routers.dash-app-http.middlewares=https-redirect

      - traefik.http.middlewares.operator-auth.basicauth.users=${OPERATOR_USERNAME?Variable not set}:${HASHED_OPERATOR_PASSWORD?Variable not set}
      - traefik.http.routers.dash-app-https.middlewares=operator-auth

    volumes:
      - ${SWTOOLSWORKINGDIR:-/tmp}:${SWTOOLSWORKINGDIR:-/tmp}

    command: ['waitress-serve', '--port=${DASH_PORT:-8080}', 'app.order_dashboard:app.server']
    env_file:
      - .env
    environment:
      - RABBITMQ_DEFAULT_USER=${RABBITMQ_DEFAULT_USER}
      - RABBITMQ_DEFAULT_PASS=${RABBITMQ_DEFAULT_PASS}
      - RABBITMQ_HOST=rabbitmq
      - REDIS_HOST=${HOSTNAME?Variable not set}
    networks:
      - traefik-public
    depends_on:
      - rabbitmq
      - redis
networks:
  traefik-public:
    external: true

and my celery config file:

import os
import warnings
from pathlib import Path

# result backend use redis
result_backend_host = os.getenv('REDIS_HOST', 'localhost')
result_backend_pass = os.getenv('REDIS_PASS', 'password')

result_backend = 'redis://:{password}@{host}:6379/0'.format(password=result_backend_pass, host=result_backend_host)


# redis_retry_on_timeout = True
redis_socket_keepalive = True

# broker use rabbitmq
rabbitmq_user = os.getenv('RABBITMQ_DEFAULT_USER', 'guest')
rabbitmq_pass = os.getenv('RABBITMQ_DEFAULT_PASS', 'guest')
rabbitmq_host = os.getenv('RABBITMQ_HOST', 'localhost')




broker_url = 'amqp://{user}:{password}@{host}:5672//'.format(user=rabbitmq_user, password=rabbitmq_pass, host=rabbitmq_host)


include = ['app.worker.tasks', 'app.dashboard.example1', 'app.dashboard.example2']


#task events
worker_send_task_events = True
task_send_sent_event = True

All the env variables are defined and it works well except my socket timeout problem! When I deploy a new worker on a container instance, I set the env variables so it connects to the rabbitmq and redis running on the docker-compose.

Here is my celery file that defines the celery app:

from celery import Celery
from app.worker import celery_config

celery_app = Celery()
celery_app.config_from_object(celery_config)

2

Answers


  1. Chosen as BEST ANSWER

    Finally changing the backend to rpc resolved the problem. I tried different things with redis that didn't work. A way to dig would be to inspect the sockets with tcp-dump to see where it blocks but I didn't try that I my problem was solved with the rpc backend.


  2. I guess that you have some firewall between your Redis instance and your worker.
    Can you login to that SandboxHost... and ensure that you can connect your redis?

    You can do that with telnet, for example:

    telnet <your_redis_hostname> <your_redis_port>
    

    or with redis-cli:

    redis-cli -h <your_redis_hostname> -p <your_redis_port>
    

    EDIT:

    Seems like you’re missing result_backend:

    result_backend = f"redis://username:{result_backend_pass}@{result_backend_host}:6379/0"
    

    and make sure that your REDIS_HOST=${HOSTNAME?Variable not set} is valid…

    EDIT2:

    Can you add the bind to your Redis command:

    ["redis-server", "--bind", "0.0.0.0", "--requirepass", "${RABBITMQ_DEFAULT_PASS:-password}"]
    

    Please be aware of its security implications!

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search