Scaling

Scale AI agent deployments on Kubernetes manually, via kubectl scale, or automatically through the daily token budget scale-to-zero behavior.

Manual scaling

Set spec.replicas directly on an ArkAgent:

spec:
  replicas: 5

The operator reconciles the backing Deployment to match. Apply a patch without editing YAML:

kubectl patch arkagent research-agent -n my-org \
  --type=merge -p '{"spec":{"replicas":10}}'

kubectl scale subresource

ArkAgent supports the Kubernetes scale subresource, so the standard kubectl scale command works:

# Scale up
kubectl scale arkagent research-agent -n my-org --replicas=5

# Drain without deleting (set to 0)
kubectl scale arkagent research-agent -n my-org --replicas=0

This integrates with any tool that uses the scale subresource — Horizontal Pod Autoscalers, GitOps controllers, etc.


Daily token budget: scale-to-zero

spec.limits.maxDailyTokens enforces a rolling 24-hour token cap. When the limit is hit, the operator automatically scales all agent replicas to 0. No manual intervention is required — replicas are restored automatically when the 24-hour window rotates and cumulative usage drops below the limit.

spec:
  limits:
    maxDailyTokens: 500000   # rolling 24h cap across all replicas

This means agents can disappear unexpectedly under normal operation if the budget is consumed. Set the limit to a value you are comfortable with being hit during a single day, or leave it unset (0) to disable it.

Two-layer enforcement

Budget enforcement happens at two points:

1. Agent-side (proactive check): Before each LLM call, the agent queries the task queue backend to sum tokens used in the last 24 hours. If the sum meets or exceeds AGENT_DAILY_TOKEN_LIMIT, the task is rejected immediately — no API call is made and the task is nacked back to the queue. This prevents runaway cost even between operator reconcile cycles.

2. Operator-side (backstop): The ArkAgent reconciler reads .status.dailyTokenUsage and scales replicas to 0 when the daily limit is reached. This is the backstop — it fires on the next reconcile after the agent-side check has already started rejecting tasks.

When debugging “why are my agents gone?”, check:

kubectl describe arkagent research-agent -n my-org
# Look for condition: BudgetExceeded
# or event: DailyLimitReached

kubectl get arkagent research-agent -n my-org \
  -o jsonpath='{.status.dailyTokenUsage}'

Resource limits (per-agent)

The spec.limits block controls LLM-level resource consumption, not Kubernetes CPU/memory (set those on the backing pod template separately):

spec:
  limits:
    maxTokensPerCall: 8000
    maxConcurrentTasks: 5
    timeoutSeconds: 120
FieldTypeDefaultDescription
maxTokensPerCallint8000Maximum tokens (input + output) per LLM API call.
maxConcurrentTasksint5Maximum tasks a single agent pod processes simultaneously.
timeoutSecondsint120Per-task timeout. Task is abandoned and an error returned after this duration.

These are injected as environment variables (AGENT_MAX_TOKENS, AGENT_TIMEOUT_SECONDS, AGENT_MAX_CONCURRENT_TASKS) into agent pods and enforced by the agent runtime.


Queue-depth autoscaling (planned)

CPU and memory are poor proxies for agent load. What matters is task queue depth — how many tasks are waiting.

Queue-depth-based autoscaling via KEDA is planned. This will let you define scale-up/scale-down triggers on task queue depth so agent pod replicas grow automatically as work arrives and shrink when the queue drains.


See also