How It Works

Mental model for ark-operator — AI agents as Kubernetes resources, the operator reconcile loop, and how agent pods process tasks.

A mental model for understanding ark-operator before diving into the details.


The core idea

Kubernetes knows how to run containers. It has no concept of what runs inside them.

ark-operator adds a layer above that. Instead of managing pods, it manages AI agents — and it understands what agents actually need: which model to call, what instructions to give it, which tools to connect, how many tokens it’s allowed to spend, and whether its output is actually good. These are not things you can express with a Deployment and a readiness probe.

When you define an ArkAgent, you declare:

  • What it knows — a system prompt that can live in a ConfigMap and be rolled out like any other config change
  • What it can do — MCP tool servers and inline webhook tools connected at startup
  • How much it can cost — per-call token limits and rolling 24h budgets enforced before any API call is made
  • Whether it’s healthy — a semantic readiness probe that actually calls the LLM and validates the response, not just whether a port is open

The operator’s job is to make reality match that declaration and keep it there — the same reconciliation loop that manages any other Kubernetes resource.


The operator reconcile loop

The operator watches your ArkAgent resources. When something changes — or on a periodic resync — it runs a reconcile:

Watch ArkAgent CR
Reconcile()
    ├── How many agent pods are currently running?
    ├── Compare with spec.replicas
    ├── Scale up → create new agent pods
    ├── Scale down → delete excess pods
    ├── Inject env vars: MODEL, SYSTEM_PROMPT, MCP_SERVERS
    ├── Run semantic liveness checks
    └── Update .status.readyReplicas

The operator never modifies your YAML. It only manages the backing Kubernetes resources (Deployments, Pods) that bring the desired state to life.


What runs inside an agent pod

Each agent pod runs the ark-runtime binary. It does one thing: poll a task queue, call the LLM, and return results.

agent pod startup
    ├── Read config from env vars (MODEL, SYSTEM_PROMPT, MCP_SERVERS, ...)
    ├── Connect to MCP tool servers
    └── Poll the task queue
        Task arrives
            ├── Build prompt from system prompt + task input
            ├── Call LLM provider (tool-use loop until model stops calling tools)
            └── Return result to queue

The operator injects all configuration as environment variables. The agent runtime has no knowledge of Kubernetes — it just reads env vars and processes tasks. This means you can run the same runtime code locally with ark run.


How tasks flow

There are two controllers involved in pipeline execution:

  • ArkTeam controller — manages infrastructure: creates role ArkAgent pods, ArkService routing, and creates an ArkRun when a trigger fires.
  • ArkRun controller — owns pipeline execution: submits step tasks to the queue, polls for results, advances the DAG, and writes final output.
External trigger (webhook / cron / ark trigger)
ArkEvent → signals ArkTeam controller
ArkTeam controller → creates ArkRun (immutable execution record)
ArkRun controller → submits step tasks to the task queue
Agent pods poll their queue → call LLM → write result back
ArkRun controller reads results → advances DAG → submits next steps
All steps done → ArkRun phase = Succeeded → output written to ArkRun.status
                 ArkTeam mirrors phase in status.lastRunPhase

The task queue is the boundary between the operator and the agent pods. Trace context is propagated across this boundary, so a single OpenTelemetry trace spans the full path from trigger to final output.

When debugging a pipeline, always start with ArkRun — that is where step outputs, token usage, and failure messages live:

kubectl get arkrun -n my-org -l arkonis.dev/team=my-team
kubectl describe arkrun my-team-run-a1b2c3 -n my-org

ArkTeam: the execution primitive

An ArkTeam is the unit of work. It defines roles (agents) and either:

  • A pipeline — a DAG of steps executed in declared order (like a CI workflow)
  • Dynamic delegation — agents decide at runtime what to delegate to whom (like an org chart)

In pipeline mode, template expressions connect step outputs to the next step’s inputs:

pipeline:
  - role: research
    inputs:
      prompt: "{{ .input.topic }}"
  - role: summarize
    dependsOn: [research]
    inputs:
      content: "{{ .steps.research.output }}"

The ArkRun controller tracks each step’s phase (Pending → Running → Succeeded/Failed) and orchestrates the DAG. You never write scheduling logic.

Each pipeline trigger creates a new ArkRun — think of it like a Kubernetes Job. The ArkTeam is the template; the ArkRun is the execution record. You can list past runs with kubectl get arkrun -n <ns>, inspect their step outputs, and see exact token usage.


What the operator does not do

  • It does not make LLM API calls — agent pods do
  • It does not parse or validate LLM outputs (unless a step has outputSchema)
  • It does not route external traffic — use ArkService for that
  • It does not store agent memory — use ArkMemory for that

Next steps