Summary
Kanban workers on long hardware/firmware workflows can be reclaimed with stale_lock while still legitimately busy, because no kanban_heartbeat / heartbeat_claim can run during a single blocking MCP tool invocation (e.g. large amlogic_firmware_download). The default claim TTL is 15 minutes (DEFAULT_CLAIM_TTL_SECONDS), so any one tool call longer than that loses the claim even if the process is healthy.
Expected behavior
- Running tasks should not be reclaimed while the worker is actively executing work, or
- There should be a supported, config-driven way to extend the initial claim TTL / heartbeat extension window for profiles that routinely block on long MCP calls.
Actual behavior
- After ~15 minutes from
claimed, release_stale_claims reclaims the task (reclaimed / stale_lock), dispatcher spawns a second worker for the same task_id.
- First session may have started burn/download pipeline; second session duplicates effort and confuses operators.
Environment
- hermes-agent revision:
faa13e49f (v2026.5.7-26-gfaa13e49f)
- Profile:
case_runner (dispatcher-spawned worker, HERMES_KANBAN_TASK set)
- Gateway: embedded kanban dispatcher (default)
Evidence (sanitized)
- Same
task_id, two runs: first run ends with outcome=reclaimed, error=stale_lock=...; current_run_id increments.
- Session timeline:
kanban_show → DUT acquire → amlogic_firmware_download / extract / amlogic_burn_firmware_start — no tool round available to call kanban_heartbeat inside a long-running single MCP response.
Related upstream issues / PRs
This report is adjacent to #21147: even with heartbeat_claim wired correctly, heartbeats cannot be issued while blocked inside one tool call longer than the TTL.
Possible directions (for maintainers)
- Config:
kanban.claim_ttl_seconds (or env HERMES_KANBAN_CLAIM_TTL_SECONDS) honored by dispatch_once / heartbeat_claim default extension.
- Task-level override on the task row (e.g. optional
claim_ttl_seconds for long-running board tasks).
- Document explicitly that any single blocking operation must finish within
DEFAULT_CLAIM_TTL_SECONDS, or split work (e.g. pre-download artifacts outside the Kanban worker).
Thanks for considering.
Summary
Kanban workers on long hardware/firmware workflows can be
reclaimedwithstale_lockwhile still legitimately busy, because nokanban_heartbeat/heartbeat_claimcan run during a single blocking MCP tool invocation (e.g. largeamlogic_firmware_download). The default claim TTL is 15 minutes (DEFAULT_CLAIM_TTL_SECONDS), so any one tool call longer than that loses the claim even if the process is healthy.Expected behavior
Actual behavior
claimed,release_stale_claimsreclaims the task (reclaimed/stale_lock), dispatcher spawns a second worker for the sametask_id.Environment
faa13e49f(v2026.5.7-26-gfaa13e49f)case_runner(dispatcher-spawned worker,HERMES_KANBAN_TASKset)Evidence (sanitized)
task_id, two runs: first run ends withoutcome=reclaimed,error=stale_lock=...;current_run_idincrements.kanban_show→ DUT acquire →amlogic_firmware_download/ extract /amlogic_burn_firmware_start— no tool round available to callkanban_heartbeatinside a long-running single MCP response.Related upstream issues / PRs
kanban_heartbeatmust extendclaim_expires(already addressed for between-tool heartbeats).stale_lock/ respawn loops (may overlap symptoms).This report is adjacent to #21147: even with
heartbeat_claimwired correctly, heartbeats cannot be issued while blocked inside one tool call longer than the TTL.Possible directions (for maintainers)
kanban.claim_ttl_seconds(or envHERMES_KANBAN_CLAIM_TTL_SECONDS) honored bydispatch_once/heartbeat_claimdefault extension.claim_ttl_secondsfor long-running board tasks).DEFAULT_CLAIM_TTL_SECONDS, or split work (e.g. pre-download artifacts outside the Kanban worker).Thanks for considering.