# Monitoring & Metrics

Arbiter exposes Prometheus-compatible metrics on the proxy's `/metrics` endpoint. These cover request volume, authorization decisions, tool usage, latency, and resource utilization.

## Accessing Metrics

```bash
$ curl http://localhost:8080/metrics
```

The response is in Prometheus text exposition format, ready to scrape.

## Available Metrics

### Counters

| Metric | Labels | Description |
|--------|--------|-------------|
| `requests_total` | `decision` (allow, deny, escalate) | Total proxied requests by authorization outcome |
| `tool_calls_total` | `tool` | Total calls per tool name |
| `anomalies_total` | (none) | Total behavioral anomalies detected |

### Histograms

| Metric | Buckets | Description |
|--------|---------|-------------|
| `request_duration_seconds` | 5ms to 10s | End-to-end request duration including upstream |
| `upstream_duration_seconds` | 5ms to 10s | Time spent waiting for the upstream MCP server |

### Gauges

| Metric | Description |
|--------|-------------|
| `active_sessions` | Currently active task sessions |
| `registered_agents` | Total registered agents |

## Useful Queries

### Denial Rate

```promql
rate(requests_total{decision="deny"}[5m]) / rate(requests_total[5m])
```

A high denial rate might mean policies are too restrictive, or it might mean an agent is misbehaving. Check the audit log to distinguish.

### Most-Called Tools

```promql
topk(10, rate(tool_calls_total[1h]))
```

### P99 Latency

```promql
histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))
```

### Upstream vs. Arbiter Overhead

```promql
histogram_quantile(0.50, rate(request_duration_seconds_bucket[5m]))
-
histogram_quantile(0.50, rate(upstream_duration_seconds_bucket[5m]))
```

The difference is Arbiter's middleware overhead, typically under 5ms for the full chain.

## Health Check

```bash
$ curl http://localhost:8080/health
OK
```

Returns 200 with body `OK` when the proxy is running and can reach the upstream. Use this for load balancer health checks and readiness probes.

## Configuration

```toml
[metrics]
enabled = true
```

Set `enabled = false` to disable the `/metrics` endpoint if you're not using Prometheus.

## Alerting Suggestions

Based on the available metrics, here's a sensible starting set of alerts:

| Alert | Condition | Why |
|-------|-----------|-----|
| High denial rate | `rate(requests_total{decision="deny"}) > 0.5 * rate(requests_total)` | More than half of requests being denied suggests misconfiguration or attack |
| Anomaly spike | `rate(anomalies_total[5m]) > 1` | Sustained anomalies mean agents are drifting from declared intent |
| High latency | `histogram_quantile(0.99, ...) > 2` | P99 above 2 seconds suggests upstream issues |
| Session exhaustion | `active_sessions` near the per-agent cap | Agents may be hitting session limits |

## Next Steps

- {doc}`../operating/monitoring`: Grafana dashboards and production monitoring setup
- {doc}`audit`: structured audit logs complement metrics with per-request detail
- {doc}`../reference/configuration`: full `[metrics]` configuration reference