Back to KB
Difficulty
Intermediate
Read Time
8 min

Kanban in Hermes Agent for Self Hosted LLM Workflows

By Codcompass TeamΒ·Β·8 min read

Deterministic Concurrency Control for Self-Hosted LLM Task Queues

Current Situation Analysis

Autonomous agent frameworks are typically architected around the assumption of infinitely scalable inference endpoints. When paired with self-hosted LLM runtimes like Ollama, vLLM, or llama.cpp, this architectural mismatch creates a critical failure mode: unbounded task dispatching. The Hermes Agent Kanban system provides a durable, SQLite-backed state machine (~/.hermes/kanban.db) to track task lifecycles across lanes. Its dispatcher component scans for ready cards, claims them atomically, and spawns isolated worker profiles. However, the default configuration lacks a global concurrency governor.

The configuration surface only exposes dispatch_in_gateway and dispatch_interval_seconds. There is no native max_active_tasks parameter wired into the dispatch path. When hermes kanban dispatch executes, it pulls every eligible card into execution during that tick. For cloud APIs, this is acceptable because rate limiting and auto-scaling occur upstream. For local GPU clusters, it triggers immediate resource exhaustion. The inference server's request queue fills, VRAM fragments, context switching thrashes, and latency spikes into timeout territory. Teams often mistake this for a model performance issue when it is actually a scheduling architecture problem.

This gap is frequently overlooked because agent frameworks prioritize task throughput over hardware preservation. The dispatcher treats the LLM gateway as a stateless function endpoint rather than a constrained compute resource. Without explicit pacing, background pipelines, interactive queries, and maintenance jobs compete for the same VRAM pool, causing cascading failures that are difficult to diagnose post-mortem.

WOW Moment: Key Findings

The core insight emerges when comparing how different dispatch strategies interact with fixed hardware constraints. The following table contrasts four common approaches against three critical operational metrics.

Dispatch StrategyGPU Utilization StabilityRequest Latency VarianceTimeout Rate Under Load
Unbounded Gateway DispatchSpikes to 100%, then thrashesHigh (200ms β†’ 15s+)Critical (>40%)
CLI --max Cap (Per-Tick)Moderate, but drifts over timeMedium (500ms β†’ 8s)Elevated (~25%)
Slot-Aware Cron ControllerStable (70-85% target range)Low (predictable 200-600ms)Minimal (<2%)
Dependency-Driven SequencingPredictable, phase-gatedVery Low (serialized)Near Zero

The data reveals a fundamental truth: limiting new spawns per tick does not equal limiting concurrent execution. The --max flag only restricts how many tasks the dispatcher claims during a single scan. It does not account for tasks already running. A slot-aware controller that calculates available_capacity = target_concurrency - active_tasks before dispatching maintains hardware within safe operating boundaries. This transforms the system from reactive crash recovery to proactive resource governance, enabling stable long-running pipelines without manual intervention.

Core Solution

Implementing deterministic concurrency control requires decoupling the dispatch trigger from the gateway process, introducing state-aware capacity calculation, and modeling task relationships explicitly. The following architecture ensures the inference server never receives more concurrent requests than the hardware can sustain.

Step 1: Isolate the Dispatch Path

Gateway-embedded dispatch and external daemon dispatch cannot safely share the same SQLite board. Concurrent claim attempts create race conditions that corrupt task state. Disable the embedded dispatcher and route all scheduling through

πŸŽ‰ Mid-Year Sale β€” Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register β€” Start Free Trial

7-day free trial Β· Cancel anytime Β· 30-day money-back