skip to Main Content

My team and I are on Airflow v2.1.0 using the Celery executor with Redis. Recently we’ve noticed some jobs are occasionally running until we kick them (many hours, sometimes days—basically until someone notices). There doesn’t seem to be a particular pattern that we’ve noticed yet.

We also use DataDog and the statsd provider to collect and monitor metrics produces by Airflow. Ideally we could setup a DataDog monitor for this but there doesn’t appear to be an obvious metric for this situation.

How can we detect and alarm on stuck jobs like this?

2

Answers


  1. You can use Airflow’s SLA in combination with sla_miss_callback parameter to call some service (we use Slack for example).

    From the docs:

    An SLA, or a Service Level Agreement, is an expectation for the maximum time a Task should take. If a task takes longer than this to run, then it visible in the "SLA Misses" part of the user interface, as well going out in an email of all tasks that missed their SLA.

    With that, you define a SLA for those tasks you want to monitor, and provide a sla_miss_callback to get notified about those misses.

    Login or Signup to reply.
  2. This issue is probably fixed by PR16550.

    The problem arrises when you restart the scheduler and all tasks that were scheduled or queued (but didn’t make it to the actual executor yet), will become in a state where the scheduler won’t be able to start it. This will remain indefinitely(even restarting the scheduler won’t fix it) without manual intervention. However, as you point out, you can indeed still run it manually.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search