After playing with some "defect" scenarios with celery (Redis being a broker for whatever it worth) we came to understanding that there is effectively no sense in setting acks_late=true
without simultaneous setting of task_reject_on_worker_lost=true
because the task won’t be rescheduled (again, in our tests) — task stays in the "unacked" category forever.
At the same time everybody says that acks_late
will make the task being subject for rescheduling on the same / another worker, so the question is: when does it happen?
The official docs say that
Note that the worker will acknowledge the message if the child process
executing the task is terminated (either by the task calling
sys.exit(), or by signal) even when acks_late is enabled. This
behavior is intentional as…
We don’t want to rerun tasks that forces the kernel to send a SIGSEGV (segmentation fault) or similar signals to the process.
We assume that a system administrator deliberately killing the task does not want it to automatically restart.
A task that allocates too much memory is in danger of triggering the kernel OOM killer, the same may happen again.
A task that always fails when redelivered may cause a high-frequency message loop taking down the system.
If you really want a task to be redelivered in these scenarios you
should consider enabling the task_reject_on_worker_lost setting.
What are possible examples of "something went wrong" that don’t fall into the "worker terminated deliberately or due to a signal caught" category?
2
Answers
Reboot, power outage, hardware failure. n.b., all of your examples assume that the prefetch multiplier is 1.
Note that there is a difference between the celery worker process, to the child processes actually executing the tasks.
By default, when you create a celery worker, it will create one "parent" process and x number of child processes which executes the tasks, where x is the number of CPUs you have (you can read more about this in the docs, and how to configure it)
I have tested all the different scenarios, these are my conclusions:
acks_late is about what happens when the worker dies. task_reject_on_worker_lost is about the actual process executing the task.
For example, if I have a k8s pod running celery process: if I send sigkill (cold shutdown) to the pod, having acks_late as true will make sure that the task will be picked up by a different worker.
But, if I kill somehow the child process executing the task (go inside the pod and kill the child process for example, or if the process exits by itself somehow), the task will not be picked up even if acks_late is true.
If you set task_reject_on_worker_lost to true, the task will be picked up again.
hope that clarifies everything