Skip to content

Kanban: stale_lock reclaim when a single MCP tool call exceeds DEFAULT_CLAIM_TTL (15m); heartbeat impossible until tool returns #25517

@yongzhigao-design

Description

@yongzhigao-design

Summary

Kanban workers on long hardware/firmware workflows can be reclaimed with stale_lock while still legitimately busy, because no kanban_heartbeat / heartbeat_claim can run during a single blocking MCP tool invocation (e.g. large amlogic_firmware_download). The default claim TTL is 15 minutes (DEFAULT_CLAIM_TTL_SECONDS), so any one tool call longer than that loses the claim even if the process is healthy.

Expected behavior

  • Running tasks should not be reclaimed while the worker is actively executing work, or
  • There should be a supported, config-driven way to extend the initial claim TTL / heartbeat extension window for profiles that routinely block on long MCP calls.

Actual behavior

  • After ~15 minutes from claimed, release_stale_claims reclaims the task (reclaimed / stale_lock), dispatcher spawns a second worker for the same task_id.
  • First session may have started burn/download pipeline; second session duplicates effort and confuses operators.

Environment

  • hermes-agent revision: faa13e49f (v2026.5.7-26-gfaa13e49f)
  • Profile: case_runner (dispatcher-spawned worker, HERMES_KANBAN_TASK set)
  • Gateway: embedded kanban dispatcher (default)

Evidence (sanitized)

  • Same task_id, two runs: first run ends with outcome=reclaimed, error=stale_lock=...; current_run_id increments.
  • Session timeline: kanban_show → DUT acquire → amlogic_firmware_download / extract / amlogic_burn_firmware_startno tool round available to call kanban_heartbeat inside a long-running single MCP response.

Related upstream issues / PRs

This report is adjacent to #21147: even with heartbeat_claim wired correctly, heartbeats cannot be issued while blocked inside one tool call longer than the TTL.

Possible directions (for maintainers)

  1. Config: kanban.claim_ttl_seconds (or env HERMES_KANBAN_CLAIM_TTL_SECONDS) honored by dispatch_once / heartbeat_claim default extension.
  2. Task-level override on the task row (e.g. optional claim_ttl_seconds for long-running board tasks).
  3. Document explicitly that any single blocking operation must finish within DEFAULT_CLAIM_TTL_SECONDS, or split work (e.g. pre-download artifacts outside the Kanban worker).

Thanks for considering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/toolsTool registry, model_tools, toolsetstool/mcpMCP client and OAuthtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions