Troubleshooting

Diagnose and fix common problems with ark-operator — stuck pipelines, CrashLooping agent pods, queue issues, budget errors, and MCP connection failures.

Start with ArkRun — that is where step outputs, failure messages, and token counts live:

kubectl get arkrun -n <ns> -l arkonis.dev/team=<team>
kubectl describe arkrun <run-name> -n <ns>

Pipeline stuck in Running

Check the ArkRun status:

kubectl get arkrun <run-name> -n <ns> -o jsonpath='{.status.steps}'

Look for steps in Running state with no recent activity. Common causes:

1. Agent pods not ready

kubectl get pods -n <ns> -l arkonis.dev/team=<team>

If pods are Pending or 0/1 Ready, see Agent pod is CrashLooping below.

2. Task not being picked up from the queue

Check whether the task is sitting in Redis:

redis-cli XLEN <namespace>.<team>.<role>

If the queue has tasks but pods are running, check the consumer group:

redis-cli XINFO GROUPS <namespace>.<team>.<role>

If pending-count is non-zero and consumers is 0, the consumer group has tasks claimed by a dead consumer. They will be re-delivered after the visibility timeout (default: 30s).

3. Step waiting on a dependency that failed

A dependsOn reference to a failed step will block all downstream steps indefinitely. Check all step phases — a failed step upstream is the usual cause. Use ark retry to reset failed steps:

ark retry <team> -n <ns>

4. Daily token budget exhausted

If spec.limits.maxDailyTokens is set and the rolling 24h budget is exceeded, replicas are scaled to zero. Check:

kubectl get arkagent -n <ns>
# Look for replicas=0 unexpectedly
kubectl describe arkagent <agent> -n <ns>
# Look for BudgetExceeded condition

See BudgetExceeded below.


Agent pod is CrashLooping

kubectl logs <pod-name> -n <ns> --previous

Common causes:

Symptom in logsCauseFix
OPENAI_API_KEY not set or ANTHROPIC_API_KEY not setAPI key Secret not found or not mountedVerify apiKeys.existingSecret references a Secret that exists in the operator namespace
dial tcp: connection refused on Redis URLRedis not running or wrong URLCheck taskQueueURL and kubectl get pods -n ark-system
model not found from providerWrong model name or wrong providerCheck AGENT_MODEL and AGENT_PROVIDER env vars on the pod
context deadline exceeded at startupMCP server unreachableSee MCP server connection failures
OOMKilledMemory limit too low for large promptsIncrease agentResources.limits.memory in Helm values

Tasks not being picked up from the queue

Verify the agent is connected to the right queue:

kubectl exec <pod-name> -n <ns> -- env | grep TASK_QUEUE_URL

The URL should match your Redis endpoint. If it’s empty, the operator hasn’t injected config — check that the ArkTeam and ArkAgent are in the same namespace and the operator is running.

Check consumer group membership:

redis-cli XINFO CONSUMERS <namespace>.<team>.<role> <consumer-group>

If no consumers appear, the agent pods haven’t successfully connected. Check pod logs for connection errors.

Check the queue key name:

Queue keys follow the format <namespace>.<team>.<role>. A typo in the role name in the ArkTeam spec will result in tasks going to a different key than the agents are watching.


BudgetExceeded condition

When the rolling 24-hour token budget (spec.limits.maxDailyTokens) is exhausted, the operator scales replicas to zero and sets a BudgetExceeded condition on the ArkAgent.

Inspect:

kubectl describe arkagent <agent> -n <ns>
# Conditions:
#   BudgetExceeded: True — rolling 24h budget of 500000 tokens exhausted

Wait for the window to rotate: the agent automatically resumes when the oldest token records in the rolling 24h window expire. No manual action is required unless you want to override.

Override the budget temporarily:

kubectl patch arkagent <agent> -n <ns> \
  --type=merge -p '{"spec":{"limits":{"maxDailyTokens":0}}}'

Setting maxDailyTokens: 0 disables the budget until you set it back.


MCP server connection failures at startup

Agent pods attempt to connect to all configured MCP servers at startup. A failure causes the pod to log a warning but not crash — the agent starts without that tool server.

kubectl logs <pod-name> -n <ns> | grep -i mcp
# WARN  mcp: failed to connect to web-search: dial tcp: connection refused

The agent will operate with the remaining tools. If a required tool is unavailable, tasks that invoke it will fail.

Diagnose connectivity:

kubectl exec <pod-name> -n <ns> -- wget -qO- <mcp-server-url>/health

If the MCP server requires auth headers, verify the Secret referenced in spec.mcpServers[].headers[].secretKeyRef exists and has the correct key:

kubectl get secret <secret-name> -n <ns> -o jsonpath='{.data}' | base64 -d

Step output is empty or malformed

If a step completes (Succeeded) but the downstream template {{ .steps.<name>.output }} resolves to empty:

  1. Check the step output directly: kubectl get arkrun <run> -n <ns> -o jsonpath='{.status.steps}'
  2. Verify the outputSchema — if the step has a schema and the LLM returned non-JSON, the output is dropped
  3. Run locally with --watch to see the raw LLM response:
    ark run team.yaml --provider openai --watch
    

Operator not reconciling after a change

If you updated an ArkTeam or ArkAgent and the operator hasn’t reacted:

kubectl rollout status deployment/ark-operator -n ark-system
kubectl logs deployment/ark-operator -n ark-system | tail -50

Look for errors like failed to get resource or reconcile error. A common cause is a missing RBAC permission after a CRD upgrade — reapply the RBAC manifests:

kubectl apply -f https://raw.githubusercontent.com/arkonis-dev/ark-operator/v0.11.1/config/rbac/

See also