Troubleshooting
Start with ArkRun — that is where step outputs, failure messages, and token counts live:
kubectl get arkrun -n <ns> -l arkonis.dev/team=<team>
kubectl describe arkrun <run-name> -n <ns>
Pipeline stuck in Running
Check the ArkRun status:
kubectl get arkrun <run-name> -n <ns> -o jsonpath='{.status.steps}'
Look for steps in Running state with no recent activity. Common causes:
1. Agent pods not ready
kubectl get pods -n <ns> -l arkonis.dev/team=<team>
If pods are Pending or 0/1 Ready, see Agent pod is CrashLooping below.
2. Task not being picked up from the queue
Check whether the task is sitting in Redis:
redis-cli XLEN <namespace>.<team>.<role>
If the queue has tasks but pods are running, check the consumer group:
redis-cli XINFO GROUPS <namespace>.<team>.<role>
If pending-count is non-zero and consumers is 0, the consumer group has tasks claimed by a dead consumer. They will be re-delivered after the visibility timeout (default: 30s).
3. Step waiting on a dependency that failed
A dependsOn reference to a failed step will block all downstream steps indefinitely. Check all step phases — a failed step upstream is the usual cause. Use ark retry to reset failed steps:
ark retry <team> -n <ns>
4. Daily token budget exhausted
If spec.limits.maxDailyTokens is set and the rolling 24h budget is exceeded, replicas are scaled to zero. Check:
kubectl get arkagent -n <ns>
# Look for replicas=0 unexpectedly
kubectl describe arkagent <agent> -n <ns>
# Look for BudgetExceeded condition
See BudgetExceeded below.
Agent pod is CrashLooping
kubectl logs <pod-name> -n <ns> --previous
Common causes:
| Symptom in logs | Cause | Fix |
|---|---|---|
OPENAI_API_KEY not set or ANTHROPIC_API_KEY not set | API key Secret not found or not mounted | Verify apiKeys.existingSecret references a Secret that exists in the operator namespace |
dial tcp: connection refused on Redis URL | Redis not running or wrong URL | Check taskQueueURL and kubectl get pods -n ark-system |
model not found from provider | Wrong model name or wrong provider | Check AGENT_MODEL and AGENT_PROVIDER env vars on the pod |
context deadline exceeded at startup | MCP server unreachable | See MCP server connection failures |
| OOMKilled | Memory limit too low for large prompts | Increase agentResources.limits.memory in Helm values |
Tasks not being picked up from the queue
Verify the agent is connected to the right queue:
kubectl exec <pod-name> -n <ns> -- env | grep TASK_QUEUE_URL
The URL should match your Redis endpoint. If it’s empty, the operator hasn’t injected config — check that the ArkTeam and ArkAgent are in the same namespace and the operator is running.
Check consumer group membership:
redis-cli XINFO CONSUMERS <namespace>.<team>.<role> <consumer-group>
If no consumers appear, the agent pods haven’t successfully connected. Check pod logs for connection errors.
Check the queue key name:
Queue keys follow the format <namespace>.<team>.<role>. A typo in the role name in the ArkTeam spec will result in tasks going to a different key than the agents are watching.
BudgetExceeded condition
When the rolling 24-hour token budget (spec.limits.maxDailyTokens) is exhausted, the operator scales replicas to zero and sets a BudgetExceeded condition on the ArkAgent.
Inspect:
kubectl describe arkagent <agent> -n <ns>
# Conditions:
# BudgetExceeded: True — rolling 24h budget of 500000 tokens exhausted
Wait for the window to rotate: the agent automatically resumes when the oldest token records in the rolling 24h window expire. No manual action is required unless you want to override.
Override the budget temporarily:
kubectl patch arkagent <agent> -n <ns> \
--type=merge -p '{"spec":{"limits":{"maxDailyTokens":0}}}'
Setting maxDailyTokens: 0 disables the budget until you set it back.
MCP server connection failures at startup
Agent pods attempt to connect to all configured MCP servers at startup. A failure causes the pod to log a warning but not crash — the agent starts without that tool server.
kubectl logs <pod-name> -n <ns> | grep -i mcp
# WARN mcp: failed to connect to web-search: dial tcp: connection refused
The agent will operate with the remaining tools. If a required tool is unavailable, tasks that invoke it will fail.
Diagnose connectivity:
kubectl exec <pod-name> -n <ns> -- wget -qO- <mcp-server-url>/health
If the MCP server requires auth headers, verify the Secret referenced in spec.mcpServers[].headers[].secretKeyRef exists and has the correct key:
kubectl get secret <secret-name> -n <ns> -o jsonpath='{.data}' | base64 -d
Step output is empty or malformed
If a step completes (Succeeded) but the downstream template {{ .steps.<name>.output }} resolves to empty:
- Check the step output directly:
kubectl get arkrun <run> -n <ns> -o jsonpath='{.status.steps}' - Verify the
outputSchema— if the step has a schema and the LLM returned non-JSON, the output is dropped - Run locally with
--watchto see the raw LLM response:ark run team.yaml --provider openai --watch
Operator not reconciling after a change
If you updated an ArkTeam or ArkAgent and the operator hasn’t reacted:
kubectl rollout status deployment/ark-operator -n ark-system
kubectl logs deployment/ark-operator -n ark-system | tail -50
Look for errors like failed to get resource or reconcile error. A common cause is a missing RBAC permission after a CRD upgrade — reapply the RBAC manifests:
kubectl apply -f https://raw.githubusercontent.com/arkonis-dev/ark-operator/v0.11.1/config/rbac/See also
- Upgrade Guide — CRD migrations and breaking changes between versions
- Observability — trace a run end-to-end with OTel
- CLI: ark run — full flags reference including retry