Monitoring & Metrics¶

Arbiter exposes Prometheus-compatible metrics on the proxy’s /metrics endpoint. These cover request volume, authorization decisions, tool usage, latency, and resource utilization.

Accessing Metrics¶

$ curl http://localhost:8080/metrics

The response is in Prometheus text exposition format, ready to scrape.

Available Metrics¶

Counters¶

Metric	Labels	Description
`requests_total`	`decision` (allow, deny, escalate)	Total proxied requests by authorization outcome
`tool_calls_total`	`tool`	Total calls per tool name
`anomalies_total`	(none)	Total behavioral anomalies detected

Histograms¶

Metric	Buckets	Description
`request_duration_seconds`	5ms to 10s	End-to-end request duration including upstream
`upstream_duration_seconds`	5ms to 10s	Time spent waiting for the upstream MCP server

Gauges¶

Metric	Description
`active_sessions`	Currently active task sessions
`registered_agents`	Total registered agents

Useful Queries¶

Denial Rate¶

rate(requests_total{decision="deny"}[5m]) / rate(requests_total[5m])

A high denial rate might mean policies are too restrictive, or it might mean an agent is misbehaving. Check the audit log to distinguish.

Most-Called Tools¶

topk(10, rate(tool_calls_total[1h]))

P99 Latency¶

histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))

Upstream vs. Arbiter Overhead¶

histogram_quantile(0.50, rate(request_duration_seconds_bucket[5m]))
-
histogram_quantile(0.50, rate(upstream_duration_seconds_bucket[5m]))

The difference is Arbiter’s middleware overhead, typically under 5ms for the full chain.

Health Check¶

$ curl http://localhost:8080/health
OK

Returns 200 with body OK when the proxy is running and can reach the upstream. Use this for load balancer health checks and readiness probes.

Configuration¶

[metrics]
enabled = true

Set enabled = false to disable the /metrics endpoint if you’re not using Prometheus.

Alerting Suggestions¶

Based on the available metrics, here’s a sensible starting set of alerts:

Alert	Condition	Why
High denial rate	`rate(requests_total{decision="deny"}) > 0.5 * rate(requests_total)`	More than half of requests being denied suggests misconfiguration or attack
Anomaly spike	`rate(anomalies_total[5m]) > 1`	Sustained anomalies mean agents are drifting from declared intent
High latency	`histogram_quantile(0.99, ...) > 2`	P99 above 2 seconds suggests upstream issues
Session exhaustion	`active_sessions` near the per-agent cap	Agents may be hitting session limits

Next Steps¶

Monitoring: Grafana dashboards and production monitoring setup
Audit & Compliance: structured audit logs complement metrics with per-request detail
Configuration Reference: full [metrics] configuration reference