Observability

OpenTelemetry traces, metrics, and Kubernetes audit events from ark-operator and agent pods — connect to any OTel-compatible backend.

ark-operator emits OpenTelemetry traces and metrics from both the operator and every agent pod. Because it uses the standard OTLP exporter, telemetry routes to any compatible backend — Jaeger, Tempo, Prometheus, Datadog, Grafana Cloud, Honeycomb, New Relic — without changing any operator code.


Metrics

Agent runtime metrics

Emitted by each agent pod during task processing.

MetricTypeUnitLabelsDescription
arkonis.task.startedCounternamespace, agent, roleTasks pulled from the queue.
arkonis.task.completedCounternamespace, agent, roleTasks completed successfully.
arkonis.task.failedCounternamespace, agent, roleTasks that errored or exhausted retries.
arkonis.task.durationHistogrammsnamespace, agent, roleEnd-to-end task wall time from poll to ack.
arkonis.task.queue_waitHistogrammsnamespace, agent, roleTime from task enqueue to agent poll (scheduling latency).
arkonis.llm.call.durationHistogrammsnamespace, agent, provider, modelSingle LLM API round-trip (one iteration of the tool-use loop).
arkonis.llm.tokens.inputCounternamespace, agent, provider, modelInput tokens consumed. Use this for cost attribution.
arkonis.llm.tokens.outputCounternamespace, agent, provider, modelOutput tokens produced.
arkonis.tool.call.durationHistogrammsnamespace, agent, tool_nameSingle tool invocation time (MCP or webhook).
arkonis.tool.call.errorsCounternamespace, agent, tool_nameTool invocations that returned an error.
arkonis.delegate.submittedCounternamespace, agent, from_role, to_roleTasks submitted via the delegate() built-in tool.

Operator metrics

Emitted by the operator reconcile loops.

MetricTypeUnitLabelsDescription
arkonis.reconcile.durationHistogrammscontroller, namespace, resultReconcile loop latency. result is ok or error.
arkonis.reconcile.errorsCountercontroller, namespaceReconcile loops that returned an error.

Traces (per agent task)

Spans are linked via W3C TraceContext propagated through task queue metadata, so a single trace spans operator → queue → agent pod.

SpanEmitted byKey attributes
arkonis.taskAgent runtimeagent.name, task.id, task.prompt_len
arkonis.llm.callLLM providerllm.model, llm.provider, llm.input_tokens, llm.output_tokens
arkonis.tool.callRunner (per tool use)tool.name, tool.type
arkonis.delegateRunnerfrom_role, to_role
arkonis.reconcileControllercontroller, namespace, name

Kubernetes Events (audit log)

Every agent action emits a structured Kubernetes Event visible with kubectl:

ReasonEmitted onDescription
TaskStartedArkAgentAgent pod picks up a task from the queue.
TaskCompletedArkAgentTask finishes successfully.
TaskFailedArkAgentTask fails or exceeds timeout.
TaskDelegatedArkAgentAgent delegates a sub-task to a team role.
kubectl get events -n ai-team --field-selector involvedObject.kind=ArkAgent
kubectl get events -n ai-team --field-selector reason=TaskDelegated

Connecting to a backend

Enable telemetry by setting OTEL_EXPORTER_OTLP_ENDPOINT on the operator and agent pods. All other configuration uses standard OTel SDK environment variables — no ark-operator-specific settings.

Jaeger (all-in-one for development)

# Deploy Jaeger all-in-one
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:latest
          env:
            - name: COLLECTOR_OTLP_ENABLED
              value: "true"
          ports:
            - containerPort: 16686   # UI
            - containerPort: 4317    # OTLP gRPC
            - containerPort: 4318    # OTLP HTTP
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger
  namespace: monitoring
spec:
  selector:
    app: jaeger
  ports:
    - name: ui
      port: 16686
    - name: otlp-grpc
      port: 4317
    - name: otlp-http
      port: 4318

Point the operator at Jaeger via Helm:

helm upgrade ark-operator arkonis/ark-operator \
  --set otel.endpoint=http://jaeger.monitoring.svc.cluster.local:4317

Open the UI:

kubectl port-forward svc/jaeger 16686:16686 -n monitoring
open http://localhost:16686

Select service ark-agent, click Find Traces, then click any trace to see the full waterfall.

Prometheus (via OpenTelemetry Collector)

Metrics are exported via OTLP. Prometheus scrapes them through an OpenTelemetry Collector configured with a prometheus exporter.

1. Deploy the OTel Collector:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: monitoring
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    exporters:
      prometheus:
        endpoint: 0.0.0.0:8889

    service:
      pipelines:
        metrics:
          receivers: [otlp]
          exporters: [prometheus]
        traces:
          receivers: [otlp]
          exporters: []   # add a trace exporter here if needed
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:latest
          args: ["--config=/etc/otel/config.yaml"]
          volumeMounts:
            - name: config
              mountPath: /etc/otel
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  selector:
    app: otel-collector
  ports:
    - name: otlp-grpc
      port: 4317
    - name: metrics
      port: 8889

2. Configure ark-operator to send to the collector:

helm upgrade ark-operator arkonis/ark-operator \
  --set otel.endpoint=http://otel-collector.monitoring.svc.cluster.local:4317

3. Add a Prometheus scrape job:

# prometheus.yml scrape config
scrape_configs:
  - job_name: ark-operator
    static_configs:
      - targets: ['otel-collector.monitoring.svc.cluster.local:8889']

Example PromQL queries:

# Task throughput (per minute)
rate(arkonis_task_completed_total[1m])

# Failed task rate
rate(arkonis_task_failed_total[1m])

# P99 task duration by agent
histogram_quantile(0.99, rate(arkonis_task_duration_milliseconds_bucket[5m]))

# Total input tokens by model (for cost attribution)
sum by (model) (rate(arkonis_llm_tokens_input_total[1h]))

# Operator reconcile error rate
rate(arkonis_reconcile_errors_total[5m])

Grafana Cloud / Datadog / other OTLP backends

Set the endpoint and any required authentication headers via standard OTel env vars:

# Grafana Cloud
helm upgrade ark-operator arkonis/ark-operator \
  --set otel.endpoint=https://otlp-gateway-prod-us-east-0.grafana.net/otlp \
  --set otel.headers="Authorization=Basic <base64-encoded-credentials>"

# Datadog
helm upgrade ark-operator arkonis/ark-operator \
  --set otel.endpoint=https://otlp.datadoghq.com \
  --set otel.headers="DD-API-KEY=<your-api-key>"

These map to the standard OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS env vars injected into both the operator and agent pods.


ark run --trace (local, no backend)

When running flows locally, --trace collects spans in-memory and prints a tree after the flow completes — no backend required:

ark run quickstart.yaml --trace
Flow succeeded in 9.1s — 1,516 tokens

arkonis.task [9.1s]
  ├─ arkonis.llm.call [4.2s]  research  in=1,204 out=312
  └─ arkonis.llm.call [2.1s]  summarize  in=312 out=88

ark trace (remote lookup)

Look up a specific task by ID in a running Jaeger or Tempo instance:

# Jaeger
ark trace <task-id> --endpoint http://localhost:16686

# Tempo
ark trace <task-id> --endpoint http://tempo.monitoring.svc.cluster.local:3100

Find task IDs from an ArkRun:

kubectl get arkrun <run-name> -n ai-team \
  -o jsonpath='{.status.steps[*].taskID}'

Environment variables

All variables follow the OpenTelemetry specification and apply to both the operator pod and all agent pods.

VariableDescription
OTEL_EXPORTER_OTLP_ENDPOINTOTel collector endpoint. Enables tracing and metrics when set. Example: http://jaeger:4317
OTEL_EXPORTER_OTLP_HEADERSComma-separated key=value headers for the OTLP exporter. Use for auth tokens.
OTEL_SERVICE_NAMEService name in traces. Defaults to ark-agent for agent pods, ark-operator for the operator.
OTEL_TRACES_SAMPLERSampling strategy. Default: parentbased_always_on. Set to traceidratio with OTEL_TRACES_SAMPLER_ARG=0.1 for 10% sampling in high-volume clusters.
OTEL_EXPORTER_OTLP_PROTOCOLTransport protocol: grpc (default) or http/protobuf.

See also