Monitoring & Metrics

Arbiter exposes Prometheus-compatible metrics on the proxy’s /metrics endpoint. These cover request volume, authorization decisions, tool usage, latency, and resource utilization.

Accessing Metrics

$ curl http://localhost:8080/metrics

The response is in Prometheus text exposition format, ready to scrape.

Available Metrics

Counters

Metric

Labels

Description

requests_total

decision (allow, deny, escalate)

Total proxied requests by authorization outcome

tool_calls_total

tool

Total calls per tool name

anomalies_total

(none)

Total behavioral anomalies detected

Histograms

Metric

Buckets

Description

request_duration_seconds

5ms to 10s

End-to-end request duration including upstream

upstream_duration_seconds

5ms to 10s

Time spent waiting for the upstream MCP server

Gauges

Metric

Description

active_sessions

Currently active task sessions

registered_agents

Total registered agents

Useful Queries

Denial Rate

rate(requests_total{decision="deny"}[5m]) / rate(requests_total[5m])

A high denial rate might mean policies are too restrictive, or it might mean an agent is misbehaving. Check the audit log to distinguish.

Most-Called Tools

topk(10, rate(tool_calls_total[1h]))

P99 Latency

histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))

Upstream vs. Arbiter Overhead

histogram_quantile(0.50, rate(request_duration_seconds_bucket[5m]))
-
histogram_quantile(0.50, rate(upstream_duration_seconds_bucket[5m]))

The difference is Arbiter’s middleware overhead, typically under 5ms for the full chain.

Health Check

$ curl http://localhost:8080/health
OK

Returns 200 with body OK when the proxy is running and can reach the upstream. Use this for load balancer health checks and readiness probes.

Configuration

[metrics]
enabled = true

Set enabled = false to disable the /metrics endpoint if you’re not using Prometheus.

Alerting Suggestions

Based on the available metrics, here’s a sensible starting set of alerts:

Alert

Condition

Why

High denial rate

rate(requests_total{decision="deny"}) > 0.5 * rate(requests_total)

More than half of requests being denied suggests misconfiguration or attack

Anomaly spike

rate(anomalies_total[5m]) > 1

Sustained anomalies mean agents are drifting from declared intent

High latency

histogram_quantile(0.99, ...) > 2

P99 above 2 seconds suggests upstream issues

Session exhaustion

active_sessions near the per-agent cap

Agents may be hitting session limits

Next Steps