Troubleshooting

Diagnose and fix common problems with ark-operator — stuck pipelines, CrashLooping agent pods, queue issues, budget errors, and MCP connection failures.

Start with ArkRun — that is where step outputs, failure messages, and token counts live:

kubectl get arkrun -n <ns> -l arkonis.dev/team=<team>
kubectl describe arkrun <run-name> -n <ns>

Pipeline stuck in Running

Check the ArkRun status:

kubectl get arkrun <run-name> -n <ns> -o jsonpath='{.status.steps}'

Look for steps in Running state with no recent activity. Common causes:

1. Agent pods not ready

kubectl get pods -n <ns> -l arkonis.dev/team=<team>

If pods are Pending or 0/1 Ready, see Agent pod is CrashLooping below.

2. Task not being picked up from the queue

Check whether the task is sitting in Redis:

redis-cli XLEN <namespace>.<team>.<role>

If the queue has tasks but pods are running, check the consumer group:

redis-cli XINFO GROUPS <namespace>.<team>.<role>

If pending-count is non-zero and consumers is 0, the consumer group has tasks claimed by a dead consumer. They will be re-delivered after the visibility timeout (default: 30s).

3. Step waiting on a dependency that failed

A dependsOn reference to a failed step will block all downstream steps indefinitely. Check all step phases — a failed step upstream is the usual cause. Use ark retry to reset failed steps:

ark retry <team> -n <ns>

4. Daily token budget exhausted

If spec.limits.maxDailyTokens is set and the rolling 24h budget is exceeded, replicas are scaled to zero. Check:

kubectl get arkagent -n <ns>
# Look for replicas=0 unexpectedly
kubectl describe arkagent <agent> -n <ns>
# Look for BudgetExceeded condition

See BudgetExceeded below.

Agent pod is CrashLooping

kubectl logs <pod-name> -n <ns> --previous

Common causes:

Symptom in logs	Cause	Fix
`OPENAI_API_KEY not set` or `ANTHROPIC_API_KEY not set`	API key Secret not found or not mounted	Verify `apiKeys.existingSecret` references a Secret that exists in the operator namespace
`dial tcp: connection refused` on Redis URL	Redis not running or wrong URL	Check `taskQueueURL` and `kubectl get pods -n ark-system`
`model not found` from provider	Wrong model name or wrong provider	Check `AGENT_MODEL` and `AGENT_PROVIDER` env vars on the pod
`context deadline exceeded` at startup	MCP server unreachable	See MCP server connection failures
OOMKilled	Memory limit too low for large prompts	Increase `agentResources.limits.memory` in Helm values

Tasks not being picked up from the queue

Verify the agent is connected to the right queue:

kubectl exec <pod-name> -n <ns> -- env | grep TASK_QUEUE_URL

The URL should match your Redis endpoint. If it’s empty, the operator hasn’t injected config — check that the ArkTeam and ArkAgent are in the same namespace and the operator is running.

Check consumer group membership:

redis-cli XINFO CONSUMERS <namespace>.<team>.<role> <consumer-group>

If no consumers appear, the agent pods haven’t successfully connected. Check pod logs for connection errors.

Check the queue key name:

Queue keys follow the format <namespace>.<team>.<role>. A typo in the role name in the ArkTeam spec will result in tasks going to a different key than the agents are watching.

BudgetExceeded condition

When the rolling 24-hour token budget (spec.limits.maxDailyTokens) is exhausted, the operator scales replicas to zero and sets a BudgetExceeded condition on the ArkAgent.

Inspect:

kubectl describe arkagent <agent> -n <ns>
# Conditions:
#   BudgetExceeded: True — rolling 24h budget of 500000 tokens exhausted

Wait for the window to rotate: the agent automatically resumes when the oldest token records in the rolling 24h window expire. No manual action is required unless you want to override.

Override the budget temporarily:

kubectl patch arkagent <agent> -n <ns> \
  --type=merge -p '{"spec":{"limits":{"maxDailyTokens":0}}}'

Setting maxDailyTokens: 0 disables the budget until you set it back.

MCP server connection failures at startup

Agent pods attempt to connect to all configured MCP servers at startup. A failure causes the pod to log a warning but not crash — the agent starts without that tool server.

kubectl logs <pod-name> -n <ns> | grep -i mcp
# WARN  mcp: failed to connect to web-search: dial tcp: connection refused

The agent will operate with the remaining tools. If a required tool is unavailable, tasks that invoke it will fail.

Diagnose connectivity:

kubectl exec <pod-name> -n <ns> -- wget -qO- <mcp-server-url>/health

If the MCP server requires auth headers, verify the Secret referenced in spec.mcpServers[].headers[].secretKeyRef exists and has the correct key:

kubectl get secret <secret-name> -n <ns> -o jsonpath='{.data}' | base64 -d

Step output is empty or malformed

If a step completes (Succeeded) but the downstream template {{ .steps.<name>.output }} resolves to empty:

Check the step output directly: kubectl get arkrun <run> -n <ns> -o jsonpath='{.status.steps}'
Verify the outputSchema — if the step has a schema and the LLM returned non-JSON, the output is dropped
Run locally with --watch to see the raw LLM response:
```
ark run team.yaml --provider openai --watch
```

Operator not reconciling after a change

If you updated an ArkTeam or ArkAgent and the operator hasn’t reacted:

kubectl rollout status deployment/ark-operator -n ark-system
kubectl logs deployment/ark-operator -n ark-system | tail -50

Look for errors like failed to get resource or reconcile error. A common cause is a missing RBAC permission after a CRD upgrade — reapply the RBAC manifests:

kubectl apply -f https://raw.githubusercontent.com/arkonis-dev/ark-operator/v0.11.1/config/rbac/