Monitoring¶

Arbiter exposes Prometheus metrics and structured JSONL audit logs. This guide covers setting up production monitoring: scraping metrics, building dashboards, and configuring alerts.

Prometheus Scraping¶

Add Arbiter to your Prometheus configuration:

scrape_configs:
  - job_name: 'arbiter'
    static_configs:
      - targets: ['arbiter:8080']
    metrics_path: /metrics
    scrape_interval: 15s

If running in Kubernetes, the equivalent service monitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: arbiter
spec:
  selector:
    matchLabels:
      app: arbiter
  endpoints:
    - port: proxy
      path: /metrics
      interval: 15s

Dashboard Layout¶

A useful Arbiter dashboard has four panels:

Request Volume and Decisions¶

# Stacked area chart: allow vs deny vs escalate
sum by (decision) (rate(requests_total[5m]))

This tells you at a glance whether the system is mostly allowing or mostly denying. A sudden spike in denials warrants investigation.

Top Tools¶

topk(10, sum by (tool) (rate(tool_calls_total[5m])))

Shows which tools agents are actually using. Useful for capacity planning and for spotting unexpected tool usage.

Request Latency¶

histogram_quantile(0.50, rate(request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))

P50, P95, P99 in a single chart. Arbiter’s overhead is typically under 5ms; if latency is high, the upstream MCP server is the bottleneck.

Active Resources¶

active_sessions
registered_agents

Gauges that show current utilization.

Alerts¶

High Denial Rate¶

- alert: ArbiterHighDenialRate
  expr: |
    rate(requests_total{decision="deny"}[5m])
    /
    rate(requests_total[5m])
    > 0.5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "More than 50% of requests are being denied"

Anomaly Detection Firing¶

- alert: ArbiterAnomalySpike
  expr: rate(anomalies_total[5m]) > 0.1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Behavioral anomalies detected. Agents may be drifting from intent."

Upstream Latency Degradation¶

- alert: ArbiterUpstreamSlow
  expr: histogram_quantile(0.99, rate(upstream_duration_seconds_bucket[5m])) > 5
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Upstream MCP server P99 latency exceeds 5 seconds"

Health Checks¶

$ curl http://localhost:8080/health
OK

Returns HTTP 200 with body OK. Use this for:

Load balancer health checks
Kubernetes liveness/readiness probes
Uptime monitoring

Structured Logs¶

Arbiter uses tracing-subscriber for structured logging. Log level is configurable at startup:

$ arbiter --config arbiter.toml --log-level info

Levels: error, warn, info, debug, trace. In production, info is a good default: it logs request summaries without the noise of debug-level middleware tracing.

Next Steps¶

Troubleshooting: diagnosing common issues
Monitoring & Metrics: detailed metrics reference
Audit & Compliance: structured audit logging