Monitoring¶
Arbiter exposes Prometheus metrics and structured JSONL audit logs. This guide covers setting up production monitoring: scraping metrics, building dashboards, and configuring alerts.
Prometheus Scraping¶
Add Arbiter to your Prometheus configuration:
scrape_configs:
- job_name: 'arbiter'
static_configs:
- targets: ['arbiter:8080']
metrics_path: /metrics
scrape_interval: 15s
If running in Kubernetes, the equivalent service monitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: arbiter
spec:
selector:
matchLabels:
app: arbiter
endpoints:
- port: proxy
path: /metrics
interval: 15s
Dashboard Layout¶
A useful Arbiter dashboard has four panels:
Request Volume and Decisions¶
# Stacked area chart: allow vs deny vs escalate
sum by (decision) (rate(requests_total[5m]))
This tells you at a glance whether the system is mostly allowing or mostly denying. A sudden spike in denials warrants investigation.
Top Tools¶
topk(10, sum by (tool) (rate(tool_calls_total[5m])))
Shows which tools agents are actually using. Useful for capacity planning and for spotting unexpected tool usage.
Request Latency¶
histogram_quantile(0.50, rate(request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))
P50, P95, P99 in a single chart. Arbiter’s overhead is typically under 5ms; if latency is high, the upstream MCP server is the bottleneck.
Active Resources¶
active_sessions
registered_agents
Gauges that show current utilization.
Alerts¶
High Denial Rate¶
- alert: ArbiterHighDenialRate
expr: |
rate(requests_total{decision="deny"}[5m])
/
rate(requests_total[5m])
> 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "More than 50% of requests are being denied"
Anomaly Detection Firing¶
- alert: ArbiterAnomalySpike
expr: rate(anomalies_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Behavioral anomalies detected. Agents may be drifting from intent."
Upstream Latency Degradation¶
- alert: ArbiterUpstreamSlow
expr: histogram_quantile(0.99, rate(upstream_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Upstream MCP server P99 latency exceeds 5 seconds"
Health Checks¶
$ curl http://localhost:8080/health
OK
Returns HTTP 200 with body OK. Use this for:
Load balancer health checks
Kubernetes liveness/readiness probes
Uptime monitoring
Structured Logs¶
Arbiter uses tracing-subscriber for structured logging. Log level is configurable at startup:
$ arbiter --config arbiter.toml --log-level info
Levels: error, warn, info, debug, trace. In production, info is a good default: it logs request summaries without the noise of debug-level middleware tracing.
Next Steps¶
Troubleshooting: diagnosing common issues
Monitoring & Metrics: detailed metrics reference
Audit & Compliance: structured audit logging