How It Works
A mental model for understanding ark-operator before diving into the details.
The core idea
Kubernetes knows how to run containers. It has no concept of what runs inside them.
ark-operator adds a layer above that. Instead of managing pods, it manages AI agents — and it understands what agents actually need: which model to call, what instructions to give it, which tools to connect, how many tokens it’s allowed to spend, and whether its output is actually good. These are not things you can express with a Deployment and a readiness probe.
When you define an ArkAgent, you declare:
- What it knows — a system prompt that can live in a ConfigMap and be rolled out like any other config change
- What it can do — MCP tool servers and inline webhook tools connected at startup
- How much it can cost — per-call token limits and rolling 24h budgets enforced before any API call is made
- Whether it’s healthy — a semantic readiness probe that actually calls the LLM and validates the response, not just whether a port is open
The operator’s job is to make reality match that declaration and keep it there — the same reconciliation loop that manages any other Kubernetes resource.
The operator reconcile loop
The operator watches your ArkAgent resources. When something changes — or on a periodic resync — it runs a reconcile:
Watch ArkAgent CR
│
▼
Reconcile()
├── How many agent pods are currently running?
├── Compare with spec.replicas
├── Scale up → create new agent pods
├── Scale down → delete excess pods
├── Inject env vars: MODEL, SYSTEM_PROMPT, MCP_SERVERS
├── Run semantic liveness checks
└── Update .status.readyReplicas
The operator never modifies your YAML. It only manages the backing Kubernetes resources (Deployments, Pods) that bring the desired state to life.
What runs inside an agent pod
Each agent pod runs the ark-runtime binary. It does one thing: poll a task queue, call the LLM, and return results.
agent pod startup
│
├── Read config from env vars (MODEL, SYSTEM_PROMPT, MCP_SERVERS, ...)
├── Connect to MCP tool servers
└── Poll the task queue
│
▼
Task arrives
│
├── Build prompt from system prompt + task input
├── Call LLM provider (tool-use loop until model stops calling tools)
└── Return result to queue
The operator injects all configuration as environment variables. The agent runtime has no knowledge of Kubernetes — it just reads env vars and processes tasks. This means you can run the same runtime code locally with ark run.
How tasks flow
There are two controllers involved in pipeline execution:
- ArkTeam controller — manages infrastructure: creates role
ArkAgentpods,ArkServicerouting, and creates anArkRunwhen a trigger fires. - ArkRun controller — owns pipeline execution: submits step tasks to the queue, polls for results, advances the DAG, and writes final output.
External trigger (webhook / cron / ark trigger)
│
▼
ArkEvent → signals ArkTeam controller
│
▼
ArkTeam controller → creates ArkRun (immutable execution record)
│
▼
ArkRun controller → submits step tasks to the task queue
│
▼
Agent pods poll their queue → call LLM → write result back
│
▼
ArkRun controller reads results → advances DAG → submits next steps
│
▼
All steps done → ArkRun phase = Succeeded → output written to ArkRun.status
ArkTeam mirrors phase in status.lastRunPhase
The task queue is the boundary between the operator and the agent pods. Trace context is propagated across this boundary, so a single OpenTelemetry trace spans the full path from trigger to final output.
When debugging a pipeline, always start with ArkRun — that is where step outputs, token usage, and failure messages live:
kubectl get arkrun -n my-org -l arkonis.dev/team=my-team
kubectl describe arkrun my-team-run-a1b2c3 -n my-org
ArkTeam: the execution primitive
An ArkTeam is the unit of work. It defines roles (agents) and either:
- A pipeline — a DAG of steps executed in declared order (like a CI workflow)
- Dynamic delegation — agents decide at runtime what to delegate to whom (like an org chart)
In pipeline mode, template expressions connect step outputs to the next step’s inputs:
pipeline:
- role: research
inputs:
prompt: "{{ .input.topic }}"
- role: summarize
dependsOn: [research]
inputs:
content: "{{ .steps.research.output }}"
The ArkRun controller tracks each step’s phase (Pending → Running → Succeeded/Failed) and orchestrates the DAG. You never write scheduling logic.
Each pipeline trigger creates a new ArkRun — think of it like a Kubernetes Job. The ArkTeam is the template; the ArkRun is the execution record. You can list past runs with kubectl get arkrun -n <ns>, inspect their step outputs, and see exact token usage.
What the operator does not do
- It does not make LLM API calls — agent pods do
- It does not parse or validate LLM outputs (unless a step has
outputSchema) - It does not route external traffic — use
ArkServicefor that - It does not store agent memory — use
ArkMemoryfor that
Next steps
- ArkAgent — agent spec walkthrough
- ArkTeam — pipeline and dynamic delegation in depth
- Building a Pipeline — step-by-step guide