My team and I are on Airflow v2.1.0 using the Celery executor with Redis. Recently we’ve noticed some jobs are occasionally running until we kick them (many hours, sometimes days—basically until someone notices). There doesn’t seem to be a particular pattern that we’ve noticed yet.
We also use DataDog and the statsd provider to collect and monitor metrics produces by Airflow. Ideally we could setup a DataDog monitor for this but there doesn’t appear to be an obvious metric for this situation.
How can we detect and alarm on stuck jobs like this?
2
Answers
You can use Airflow’s SLA in combination with
sla_miss_callback
parameter to call some service (we use Slack for example).From the docs:
With that, you define a SLA for those tasks you want to monitor, and provide a
sla_miss_callback
to get notified about those misses.This issue is probably fixed by PR16550.
The problem arrises when you restart the scheduler and all tasks that were scheduled or queued (but didn’t make it to the actual executor yet), will become in a state where the scheduler won’t be able to start it. This will remain indefinitely(even restarting the scheduler won’t fix it) without manual intervention. However, as you point out, you can indeed still run it manually.