# Monitoring Arbiter exposes Prometheus metrics and structured JSONL audit logs. This guide covers setting up production monitoring: scraping metrics, building dashboards, and configuring alerts. ## Prometheus Scraping Add Arbiter to your Prometheus configuration: ```yaml scrape_configs: - job_name: 'arbiter' static_configs: - targets: ['arbiter:8080'] metrics_path: /metrics scrape_interval: 15s ``` If running in Kubernetes, the equivalent service monitor: ```yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: arbiter spec: selector: matchLabels: app: arbiter endpoints: - port: proxy path: /metrics interval: 15s ``` ## Dashboard Layout A useful Arbiter dashboard has four panels: ### Request Volume and Decisions ```promql # Stacked area chart: allow vs deny vs escalate sum by (decision) (rate(requests_total[5m])) ``` This tells you at a glance whether the system is mostly allowing or mostly denying. A sudden spike in denials warrants investigation. ### Top Tools ```promql topk(10, sum by (tool) (rate(tool_calls_total[5m]))) ``` Shows which tools agents are actually using. Useful for capacity planning and for spotting unexpected tool usage. ### Request Latency ```promql histogram_quantile(0.50, rate(request_duration_seconds_bucket[5m])) histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m])) histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m])) ``` P50, P95, P99 in a single chart. Arbiter's overhead is typically under 5ms; if latency is high, the upstream MCP server is the bottleneck. ### Active Resources ```promql active_sessions registered_agents ``` Gauges that show current utilization. ## Alerts ### High Denial Rate ```yaml - alert: ArbiterHighDenialRate expr: | rate(requests_total{decision="deny"}[5m]) / rate(requests_total[5m]) > 0.5 for: 5m labels: severity: warning annotations: summary: "More than 50% of requests are being denied" ``` ### Anomaly Detection Firing ```yaml - alert: ArbiterAnomalySpike expr: rate(anomalies_total[5m]) > 0.1 for: 10m labels: severity: warning annotations: summary: "Behavioral anomalies detected. Agents may be drifting from intent." ``` ### Upstream Latency Degradation ```yaml - alert: ArbiterUpstreamSlow expr: histogram_quantile(0.99, rate(upstream_duration_seconds_bucket[5m])) > 5 for: 5m labels: severity: critical annotations: summary: "Upstream MCP server P99 latency exceeds 5 seconds" ``` ## Health Checks ```bash $ curl http://localhost:8080/health OK ``` Returns HTTP 200 with body `OK`. Use this for: - Load balancer health checks - Kubernetes liveness/readiness probes - Uptime monitoring ## Structured Logs Arbiter uses `tracing-subscriber` for structured logging. Log level is configurable at startup: ```bash $ arbiter --config arbiter.toml --log-level info ``` Levels: `error`, `warn`, `info`, `debug`, `trace`. In production, `info` is a good default: it logs request summaries without the noise of debug-level middleware tracing. ## Next Steps - {doc}`troubleshooting`: diagnosing common issues - {doc}`../guides/metrics`: detailed metrics reference - {doc}`../guides/audit`: structured audit logging