Monitoring & Metrics¶
Arbiter exposes Prometheus-compatible metrics on the proxy’s /metrics endpoint. These cover request volume, authorization decisions, tool usage, latency, and resource utilization.
Accessing Metrics¶
$ curl http://localhost:8080/metrics
The response is in Prometheus text exposition format, ready to scrape.
Available Metrics¶
Counters¶
Metric |
Labels |
Description |
|---|---|---|
|
|
Total proxied requests by authorization outcome |
|
|
Total calls per tool name |
|
(none) |
Total behavioral anomalies detected |
Histograms¶
Metric |
Buckets |
Description |
|---|---|---|
|
5ms to 10s |
End-to-end request duration including upstream |
|
5ms to 10s |
Time spent waiting for the upstream MCP server |
Gauges¶
Metric |
Description |
|---|---|
|
Currently active task sessions |
|
Total registered agents |
Useful Queries¶
Denial Rate¶
rate(requests_total{decision="deny"}[5m]) / rate(requests_total[5m])
A high denial rate might mean policies are too restrictive, or it might mean an agent is misbehaving. Check the audit log to distinguish.
Most-Called Tools¶
topk(10, rate(tool_calls_total[1h]))
P99 Latency¶
histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))
Upstream vs. Arbiter Overhead¶
histogram_quantile(0.50, rate(request_duration_seconds_bucket[5m]))
-
histogram_quantile(0.50, rate(upstream_duration_seconds_bucket[5m]))
The difference is Arbiter’s middleware overhead, typically under 5ms for the full chain.
Health Check¶
$ curl http://localhost:8080/health
OK
Returns 200 with body OK when the proxy is running and can reach the upstream. Use this for load balancer health checks and readiness probes.
Configuration¶
[metrics]
enabled = true
Set enabled = false to disable the /metrics endpoint if you’re not using Prometheus.
Alerting Suggestions¶
Based on the available metrics, here’s a sensible starting set of alerts:
Alert |
Condition |
Why |
|---|---|---|
High denial rate |
|
More than half of requests being denied suggests misconfiguration or attack |
Anomaly spike |
|
Sustained anomalies mean agents are drifting from declared intent |
High latency |
|
P99 above 2 seconds suggests upstream issues |
Session exhaustion |
|
Agents may be hitting session limits |
Next Steps¶
Monitoring: Grafana dashboards and production monitoring setup
Audit & Compliance: structured audit logs complement metrics with per-request detail
Configuration Reference: full
[metrics]configuration reference