Observability

OpenTelemetry traces, metrics, and Kubernetes audit events from ark-operator and agent pods — connect to any OTel-compatible backend.

ark-operator emits OpenTelemetry traces and metrics from both the operator and every agent pod. Because it uses the standard OTLP exporter, telemetry routes to any compatible backend — Jaeger, Tempo, Prometheus, Datadog, Grafana Cloud, Honeycomb, New Relic — without changing any operator code.

Metrics

Agent runtime metrics

Emitted by each agent pod during task processing.

Metric	Type	Unit	Labels	Description
`arkonis.task.started`	Counter	—	`namespace`, `agent`, `role`	Tasks pulled from the queue.
`arkonis.task.completed`	Counter	—	`namespace`, `agent`, `role`	Tasks completed successfully.
`arkonis.task.failed`	Counter	—	`namespace`, `agent`, `role`	Tasks that errored or exhausted retries.
`arkonis.task.duration`	Histogram	ms	`namespace`, `agent`, `role`	End-to-end task wall time from poll to ack.
`arkonis.task.queue_wait`	Histogram	ms	`namespace`, `agent`, `role`	Time from task enqueue to agent poll (scheduling latency).
`arkonis.llm.call.duration`	Histogram	ms	`namespace`, `agent`, `provider`, `model`	Single LLM API round-trip (one iteration of the tool-use loop).
`arkonis.llm.tokens.input`	Counter	—	`namespace`, `agent`, `provider`, `model`	Input tokens consumed. Use this for cost attribution.
`arkonis.llm.tokens.output`	Counter	—	`namespace`, `agent`, `provider`, `model`	Output tokens produced.
`arkonis.tool.call.duration`	Histogram	ms	`namespace`, `agent`, `tool_name`	Single tool invocation time (MCP or webhook).
`arkonis.tool.call.errors`	Counter	—	`namespace`, `agent`, `tool_name`	Tool invocations that returned an error.
`arkonis.delegate.submitted`	Counter	—	`namespace`, `agent`, `from_role`, `to_role`	Tasks submitted via the `delegate()` built-in tool.

Operator metrics

Emitted by the operator reconcile loops.

Metric	Type	Unit	Labels	Description
`arkonis.reconcile.duration`	Histogram	ms	`controller`, `namespace`, `result`	Reconcile loop latency. `result` is `ok` or `error`.
`arkonis.reconcile.errors`	Counter	—	`controller`, `namespace`	Reconcile loops that returned an error.

Traces (per agent task)

Spans are linked via W3C TraceContext propagated through task queue metadata, so a single trace spans operator → queue → agent pod.

Span	Emitted by	Key attributes
`arkonis.task`	Agent runtime	`agent.name`, `task.id`, `task.prompt_len`
`arkonis.llm.call`	LLM provider	`llm.model`, `llm.provider`, `llm.input_tokens`, `llm.output_tokens`
`arkonis.tool.call`	Runner (per tool use)	`tool.name`, `tool.type`
`arkonis.delegate`	Runner	`from_role`, `to_role`
`arkonis.reconcile`	Controller	`controller`, `namespace`, `name`

Kubernetes Events (audit log)

Every agent action emits a structured Kubernetes Event visible with kubectl:

Reason	Emitted on	Description
`TaskStarted`	`ArkAgent`	Agent pod picks up a task from the queue.
`TaskCompleted`	`ArkAgent`	Task finishes successfully.
`TaskFailed`	`ArkAgent`	Task fails or exceeds timeout.
`TaskDelegated`	`ArkAgent`	Agent delegates a sub-task to a team role.

kubectl get events -n ai-team --field-selector involvedObject.kind=ArkAgent
kubectl get events -n ai-team --field-selector reason=TaskDelegated

Connecting to a backend

Enable telemetry by setting OTEL_EXPORTER_OTLP_ENDPOINT on the operator and agent pods. All other configuration uses standard OTel SDK environment variables — no ark-operator-specific settings.

Jaeger (all-in-one for development)

# Deploy Jaeger all-in-one
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:latest
          env:
            - name: COLLECTOR_OTLP_ENABLED
              value: "true"
          ports:
            - containerPort: 16686   # UI
            - containerPort: 4317    # OTLP gRPC
            - containerPort: 4318    # OTLP HTTP
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger
  namespace: monitoring
spec:
  selector:
    app: jaeger
  ports:
    - name: ui
      port: 16686
    - name: otlp-grpc
      port: 4317
    - name: otlp-http
      port: 4318

Point the operator at Jaeger via Helm:

helm upgrade ark-operator arkonis/ark-operator \
  --set otel.endpoint=http://jaeger.monitoring.svc.cluster.local:4317

Open the UI:

kubectl port-forward svc/jaeger 16686:16686 -n monitoring
open http://localhost:16686

Select service ark-agent, click Find Traces, then click any trace to see the full waterfall.

Prometheus (via OpenTelemetry Collector)

Metrics are exported via OTLP. Prometheus scrapes them through an OpenTelemetry Collector configured with a prometheus exporter.

1. Deploy the OTel Collector:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: monitoring
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    exporters:
      prometheus:
        endpoint: 0.0.0.0:8889

    service:
      pipelines:
        metrics:
          receivers: [otlp]
          exporters: [prometheus]
        traces:
          receivers: [otlp]
          exporters: []   # add a trace exporter here if needed
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:latest
          args: ["--config=/etc/otel/config.yaml"]
          volumeMounts:
            - name: config
              mountPath: /etc/otel
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  selector:
    app: otel-collector
  ports:
    - name: otlp-grpc
      port: 4317
    - name: metrics
      port: 8889

2. Configure ark-operator to send to the collector:

helm upgrade ark-operator arkonis/ark-operator \
  --set otel.endpoint=http://otel-collector.monitoring.svc.cluster.local:4317

3. Add a Prometheus scrape job:

# prometheus.yml scrape config
scrape_configs:
  - job_name: ark-operator
    static_configs:
      - targets: ['otel-collector.monitoring.svc.cluster.local:8889']

Example PromQL queries:

# Task throughput (per minute)
rate(arkonis_task_completed_total[1m])

# Failed task rate
rate(arkonis_task_failed_total[1m])

# P99 task duration by agent
histogram_quantile(0.99, rate(arkonis_task_duration_milliseconds_bucket[5m]))

# Total input tokens by model (for cost attribution)
sum by (model) (rate(arkonis_llm_tokens_input_total[1h]))

# Operator reconcile error rate
rate(arkonis_reconcile_errors_total[5m])

Grafana Cloud / Datadog / other OTLP backends

Set the endpoint and any required authentication headers via standard OTel env vars:

# Grafana Cloud
helm upgrade ark-operator arkonis/ark-operator \
  --set otel.endpoint=https://otlp-gateway-prod-us-east-0.grafana.net/otlp \
  --set otel.headers="Authorization=Basic <base64-encoded-credentials>"

# Datadog
helm upgrade ark-operator arkonis/ark-operator \
  --set otel.endpoint=https://otlp.datadoghq.com \
  --set otel.headers="DD-API-KEY=<your-api-key>"

These map to the standard OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS env vars injected into both the operator and agent pods.

`ark run --trace` (local, no backend)

When running flows locally, --trace collects spans in-memory and prints a tree after the flow completes — no backend required:

ark run quickstart.yaml --trace

Flow succeeded in 9.1s — 1,516 tokens

arkonis.task [9.1s]
  ├─ arkonis.llm.call [4.2s]  research  in=1,204 out=312
  └─ arkonis.llm.call [2.1s]  summarize  in=312 out=88

`ark trace` (remote lookup)

Look up a specific task by ID in a running Jaeger or Tempo instance:

# Jaeger
ark trace <task-id> --endpoint http://localhost:16686

# Tempo
ark trace <task-id> --endpoint http://tempo.monitoring.svc.cluster.local:3100

Find task IDs from an ArkRun:

kubectl get arkrun <run-name> -n ai-team \
  -o jsonpath='{.status.steps[*].taskID}'

Environment variables

All variables follow the OpenTelemetry specification and apply to both the operator pod and all agent pods.

Variable	Description
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTel collector endpoint. Enables tracing and metrics when set. Example: `http://jaeger:4317`
`OTEL_EXPORTER_OTLP_HEADERS`	Comma-separated `key=value` headers for the OTLP exporter. Use for auth tokens.
`OTEL_SERVICE_NAME`	Service name in traces. Defaults to `ark-agent` for agent pods, `ark-operator` for the operator.
`OTEL_TRACES_SAMPLER`	Sampling strategy. Default: `parentbased_always_on`. Set to `traceidratio` with `OTEL_TRACES_SAMPLER_ARG=0.1` for 10% sampling in high-volume clusters.
`OTEL_EXPORTER_OTLP_PROTOCOL`	Transport protocol: `grpc` (default) or `http/protobuf`.