Observability
ark-operator emits OpenTelemetry traces and metrics from both the operator and every agent pod. Because it uses the standard OTLP exporter, telemetry routes to any compatible backend — Jaeger, Tempo, Prometheus, Datadog, Grafana Cloud, Honeycomb, New Relic — without changing any operator code.
Metrics
Agent runtime metrics
Emitted by each agent pod during task processing.
| Metric | Type | Unit | Labels | Description |
|---|---|---|---|---|
arkonis.task.started | Counter | — | namespace, agent, role | Tasks pulled from the queue. |
arkonis.task.completed | Counter | — | namespace, agent, role | Tasks completed successfully. |
arkonis.task.failed | Counter | — | namespace, agent, role | Tasks that errored or exhausted retries. |
arkonis.task.duration | Histogram | ms | namespace, agent, role | End-to-end task wall time from poll to ack. |
arkonis.task.queue_wait | Histogram | ms | namespace, agent, role | Time from task enqueue to agent poll (scheduling latency). |
arkonis.llm.call.duration | Histogram | ms | namespace, agent, provider, model | Single LLM API round-trip (one iteration of the tool-use loop). |
arkonis.llm.tokens.input | Counter | — | namespace, agent, provider, model | Input tokens consumed. Use this for cost attribution. |
arkonis.llm.tokens.output | Counter | — | namespace, agent, provider, model | Output tokens produced. |
arkonis.tool.call.duration | Histogram | ms | namespace, agent, tool_name | Single tool invocation time (MCP or webhook). |
arkonis.tool.call.errors | Counter | — | namespace, agent, tool_name | Tool invocations that returned an error. |
arkonis.delegate.submitted | Counter | — | namespace, agent, from_role, to_role | Tasks submitted via the delegate() built-in tool. |
Operator metrics
Emitted by the operator reconcile loops.
| Metric | Type | Unit | Labels | Description |
|---|---|---|---|---|
arkonis.reconcile.duration | Histogram | ms | controller, namespace, result | Reconcile loop latency. result is ok or error. |
arkonis.reconcile.errors | Counter | — | controller, namespace | Reconcile loops that returned an error. |
Traces (per agent task)
Spans are linked via W3C TraceContext propagated through task queue metadata, so a single trace spans operator → queue → agent pod.
| Span | Emitted by | Key attributes |
|---|---|---|
arkonis.task | Agent runtime | agent.name, task.id, task.prompt_len |
arkonis.llm.call | LLM provider | llm.model, llm.provider, llm.input_tokens, llm.output_tokens |
arkonis.tool.call | Runner (per tool use) | tool.name, tool.type |
arkonis.delegate | Runner | from_role, to_role |
arkonis.reconcile | Controller | controller, namespace, name |
Kubernetes Events (audit log)
Every agent action emits a structured Kubernetes Event visible with kubectl:
| Reason | Emitted on | Description |
|---|---|---|
TaskStarted | ArkAgent | Agent pod picks up a task from the queue. |
TaskCompleted | ArkAgent | Task finishes successfully. |
TaskFailed | ArkAgent | Task fails or exceeds timeout. |
TaskDelegated | ArkAgent | Agent delegates a sub-task to a team role. |
kubectl get events -n ai-team --field-selector involvedObject.kind=ArkAgent
kubectl get events -n ai-team --field-selector reason=TaskDelegated
Connecting to a backend
Enable telemetry by setting OTEL_EXPORTER_OTLP_ENDPOINT on the operator and agent pods. All other configuration uses standard OTel SDK environment variables — no ark-operator-specific settings.
Jaeger (all-in-one for development)
# Deploy Jaeger all-in-one
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
ports:
- containerPort: 16686 # UI
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
namespace: monitoring
spec:
selector:
app: jaeger
ports:
- name: ui
port: 16686
- name: otlp-grpc
port: 4317
- name: otlp-http
port: 4318
Point the operator at Jaeger via Helm:
helm upgrade ark-operator arkonis/ark-operator \
--set otel.endpoint=http://jaeger.monitoring.svc.cluster.local:4317
Open the UI:
kubectl port-forward svc/jaeger 16686:16686 -n monitoring
open http://localhost:16686
Select service ark-agent, click Find Traces, then click any trace to see the full waterfall.
Prometheus (via OpenTelemetry Collector)
Metrics are exported via OTLP. Prometheus scrapes them through an OpenTelemetry Collector configured with a prometheus exporter.
1. Deploy the OTel Collector:
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: monitoring
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
exporters:
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheus]
traces:
receivers: [otlp]
exporters: [] # add a trace exporter here if needed
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
args: ["--config=/etc/otel/config.yaml"]
volumeMounts:
- name: config
mountPath: /etc/otel
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: monitoring
spec:
selector:
app: otel-collector
ports:
- name: otlp-grpc
port: 4317
- name: metrics
port: 8889
2. Configure ark-operator to send to the collector:
helm upgrade ark-operator arkonis/ark-operator \
--set otel.endpoint=http://otel-collector.monitoring.svc.cluster.local:4317
3. Add a Prometheus scrape job:
# prometheus.yml scrape config
scrape_configs:
- job_name: ark-operator
static_configs:
- targets: ['otel-collector.monitoring.svc.cluster.local:8889']
Example PromQL queries:
# Task throughput (per minute)
rate(arkonis_task_completed_total[1m])
# Failed task rate
rate(arkonis_task_failed_total[1m])
# P99 task duration by agent
histogram_quantile(0.99, rate(arkonis_task_duration_milliseconds_bucket[5m]))
# Total input tokens by model (for cost attribution)
sum by (model) (rate(arkonis_llm_tokens_input_total[1h]))
# Operator reconcile error rate
rate(arkonis_reconcile_errors_total[5m])
Grafana Cloud / Datadog / other OTLP backends
Set the endpoint and any required authentication headers via standard OTel env vars:
# Grafana Cloud
helm upgrade ark-operator arkonis/ark-operator \
--set otel.endpoint=https://otlp-gateway-prod-us-east-0.grafana.net/otlp \
--set otel.headers="Authorization=Basic <base64-encoded-credentials>"
# Datadog
helm upgrade ark-operator arkonis/ark-operator \
--set otel.endpoint=https://otlp.datadoghq.com \
--set otel.headers="DD-API-KEY=<your-api-key>"
These map to the standard OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS env vars injected into both the operator and agent pods.
ark run --trace (local, no backend)
When running flows locally, --trace collects spans in-memory and prints a tree after the flow completes — no backend required:
ark run quickstart.yaml --trace
Flow succeeded in 9.1s — 1,516 tokens
arkonis.task [9.1s]
├─ arkonis.llm.call [4.2s] research in=1,204 out=312
└─ arkonis.llm.call [2.1s] summarize in=312 out=88
ark trace (remote lookup)
Look up a specific task by ID in a running Jaeger or Tempo instance:
# Jaeger
ark trace <task-id> --endpoint http://localhost:16686
# Tempo
ark trace <task-id> --endpoint http://tempo.monitoring.svc.cluster.local:3100
Find task IDs from an ArkRun:
kubectl get arkrun <run-name> -n ai-team \
-o jsonpath='{.status.steps[*].taskID}'
Environment variables
All variables follow the OpenTelemetry specification and apply to both the operator pod and all agent pods.
| Variable | Description |
|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | OTel collector endpoint. Enables tracing and metrics when set. Example: http://jaeger:4317 |
OTEL_EXPORTER_OTLP_HEADERS | Comma-separated key=value headers for the OTLP exporter. Use for auth tokens. |
OTEL_SERVICE_NAME | Service name in traces. Defaults to ark-agent for agent pods, ark-operator for the operator. |
OTEL_TRACES_SAMPLER | Sampling strategy. Default: parentbased_always_on. Set to traceidratio with OTEL_TRACES_SAMPLER_ARG=0.1 for 10% sampling in high-volume clusters. |
OTEL_EXPORTER_OTLP_PROTOCOL | Transport protocol: grpc (default) or http/protobuf. |
See also
- CLI: ark trace — look up traces from the terminal
- Environment Variables reference — all OTel-related vars
- Cost Management guide — using
arkonis.llm.tokens.*for spend tracking