Orchestration

Pause & recovery

Pausing work, surviving rate limits, and reclaiming orphaned runs.

Orchestration is interruptible and self-healing: a plan can be paused at any time, providers that hit rate limits back off instead of failing, and runs orphaned by a dead session are reclaimed automatically.

Pause & resume

  • Pausing a plan (status = paused) halts all new dispatch, auto-review and merge until it is resumed.
  • A task paused mid-flight moves to waiting_for_resume; Resume returns it to draft for clean re-dispatch.
  • A provider that hits a rate limit pauses its run via POST /api/v1/cli-runs/{uuid}/pause (with a resume time), instead of burning the task as failed.

Recovery

  • Dead-session reclaim — runs left by a vanished session are re-emitted to the spawn queue.
  • On lead connect/resume, orphaned queued runs are recovered and re-dispatched.
  • A spawn loop guard + in-flight cap stop a lead from fanning out unboundedly.
  • Idle workers (no chunk for ~45s) are pruned from the lead digest; the SessionEnd hook reaps spawned worker processes by pid.
  • The reject → revision loop is capped at 3 cycles (REVISION_CAP); past that an operator must intervene.