Capacity Planning

Architecture Summary

Reva consists of six components that share host resources:

                     +-----------+
Microsoft Teams ---->| Reva App  |----> Ollama (GPU)
                     |  FastAPI  |       - Router: llama3.2:3b
                     |  Python   |       - Agent:  qwen3:14b
                     +-----+-----+
                           |
                +----------+----------+
                |          |          |
          +-----+--+ +----+---+ +----+---+
          |PostgreSQL| | Redis  | |  MCP   |
          | pgvector | |  cache | | servers|
          +----------+ +--------+ +--------+
                                    (Release, Jira)

Resource consumption profile:

GPU (Ollama): The dominant bottleneck. All LLM inference is serialized through the GPU. Every user request requires 2–4 LLM calls (1 router classification + 1–3 agent steps).
CPU/RAM (Reva App): Lightweight async Python process. Spends most time waiting on GPU and MCP responses.
CPU/RAM (MCP servers): Two sidecar containers handling API calls to Digital.ai Release and Jira. Median latency 63ms.
CPU/RAM (PostgreSQL): Conversation history and pgvector embeddings. Sub-millisecond query times. Never a bottleneck in testing.
CPU/RAM (Redis): Session cache. Sub-millisecond latency. Negligible resource usage.

Component Requirements

Minimum resources per component for a single-instance deployment:

Component	CPU Request	CPU Limit	Memory Request	Memory Limit	Storage	Notes
Reva App	100m	1000m	256Mi	1Gi	Minimal	Async I/O; ~791 MB RSS observed under load
PostgreSQL	100m	500m	256Mi	512Mi	10Gi PVC	pgvector index grows with conversation count
Redis	50m	200m	64Mi	128Mi	Ephemeral	Session cache only; data loss is non-critical
Release MCP	50m	500m	128Mi	256Mi	None	Sidecar; REST calls to Release server
Jira MCP	50m	500m	128Mi	256Mi	None	Sidecar; REST calls to Jira instance
Ollama	1000m	—	2Gi	—	10–20Gi	Model files stored on disk; GPU is the real resource

Total minimum (excluding GPU): ~1.5 CPU cores, ~3 GB RAM, ~20 GB storage. The GPU is not listed with CPU/memory limits because it runs on the host (or a dedicated node) and uses VRAM as its primary resource.

GPU Sizing

The GPU is the single most important capacity decision. All other components are cheap relative to the GPU.

GPU VRAM	Example GPUs	Agent Model	NUM_PARALLEL	Concurrent Users	Response Time (p50)	Viable?
8 GB	RTX 5060, RTX 4060	qwen3:8b	1	1	~15s (estimated)	Basic use only
12 GB	RTX 5070, RTX 4070	qwen3:14b (tight)	1	1	~25s (estimated)	Marginal; KV cache pressure
16 GB	RTX 5070 Ti, RTX 5080	qwen3:14b	1	1–2	22s (measured)	Tested baseline
24 GB	RTX 5090, A10, L4	qwen3:14b	2	3–5	~12s (projected)	Recommended
48 GB	L40S, A6000	qwen3:14b + vLLM	Batched	10+	~8s (projected)	Multi-user production
2x GPU	Any combination	Dedicated router + agent	1 each	3–5	~18s (projected)	Eliminates contention

Key Findings from Testing (RTX 5070 Ti 16GB)

Single user: 22s median end-to-end (4 agent steps, 1 tool call).
Throughput ceiling: 0.05 QPS (constant regardless of concurrency) = ~3 requests/min.
Linear degradation: Each additional concurrent user adds ~20s to wait time due to GPU queue serialization.
5 concurrent users: 40% of requests timed out at the 120s client timeout.
NUM_PARALLEL=2 on 16GB: 5x slower (110s vs 22s). The qwen3:14b model uses ~9 GB for weights, leaving only ~7 GB for KV cache. Two parallel sequences overflow VRAM and trigger CPU offloading.

When to Use Which GPU Tier

8 GB: Proof of concept or personal use. Must use a smaller model (qwen3:8b). Expect reduced tool-calling accuracy.
16 GB: Small team (< 20 users), light usage. Acceptable if most users send only 1–2 queries per day.
24 GB: Medium team (20–100 users), moderate usage. Enables NUM_PARALLEL=2 without KV cache overflow. First tier where vLLM becomes viable.
48 GB: Large deployment or heavy usage. vLLM continuous batching provides near-linear throughput scaling up to 10+ concurrent users.

Users per Instance

“Concurrent users” is different from “total users.” Most enterprise users interact with Reva a few times per day. The table below maps total Teams users to required infrastructure, based on measured throughput of ~3 requests/min on a 16GB GPU.

Usage Pattern	Queries/User/Hour	Peak Concurrent (est.)	Max Users (16GB)	Max Users (24GB)	Max Users (48GB)
Light	< 1	1	50–100	100–200	300+
Medium	1–5	2–3	20–50	50–100	150–250
Heavy	5–20	5–10	10–20	20–50	75–150
Power users	20+	10+	5–10	10–20	50–75

How to Estimate Your Usage Pattern

Count your total Reva-eligible users (release managers, ops engineers, etc.).
Estimate peak hour load: typically 10–20% of users are active in the busiest hour.
Multiply active users by average queries per hour.
If peak queries/minute exceeds 3 (16GB) or 6 (24GB), you need a larger GPU or multiple instances.

Example: 40 release managers, medium usage (3 queries/hour each during peak). Peak concurrent = 40 × 0.15 × 3 / 60 = 0.3 queries/sec = 18 queries/min. This exceeds 16GB capacity (3 req/min) and 24GB capacity (~6 req/min). Solution: 48GB GPU or multiple instances.

Scaling Decision Tree

Use this tree when response times or timeouts indicate scaling is needed:

Is p95 response time > 30s? ├── YES ─ Are you on 16GB VRAM? │ ├── YES ─ Upgrade to 24GB GPU (first priority) │ └── NO ─ Are you on 24GB+ VRAM with NUM_PARALLEL=1? │ ├── YES ─ Enable NUM_PARALLEL=2, retest │ └── NO ─ Is GPU utilization > 90%? │ ├── YES ─ Consider vLLM (48GB+) or add a second instance │ └── NO ─ Check MCP/DB latency — bottleneck may be elsewhere └── NO ─ Is error rate > 5%? ├── YES ─ Check HTTP timeouts (increase client timeout) │ Check DB pool exhaustion (increase pool_size) │ Check MCP server crashes (review container logs) └── NO ─ Current capacity is adequate. Monitor trends.

Scaling Options, in Priority Order

Priority	Action	Cost	Expected Improvement	When to Use
1	Upgrade GPU to 24GB	$800–1500	2x concurrent capacity	First scaling step from 16GB
2	Enable NUM_PARALLEL=2 (24GB+)	Free	~2x throughput	After GPU upgrade
3	Use a faster/smaller model	Free	30–50% latency reduction	If accuracy is acceptable with qwen3:8b
4	Add second GPU for router	$200–500	Eliminates router/agent contention	If router latency > 2s under load
5	Switch to vLLM (48GB+)	$3000–6000 (GPU)	5–10x throughput	High-concurrency deployments
6	Deploy multiple instances	2x infra	Linear capacity scaling	When single-GPU scaling is exhausted
7	Use cloud LLM (Claude/OpenAI)	Per-token cost	Unlimited scaling	See cost comparison below

K8s Resource Examples

Production-tested resource specifications from the project’s Kubernetes manifests.

Reva Application Pod (includes MCP sidecars)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: reva
  namespace: reva
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: reva
          image: reva:latest
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              memory: 1Gi

        - name: release-mcp
          image: xebialabsearlyaccess/dai-release-mcp:25.3.0-beta.926
          resources:
            requests:
              cpu: 50m
              memory: 128Mi
            limits:
              memory: 256Mi

        - name: jira-mcp
          image: ghcr.io/sooperset/mcp-atlassian:0.21.0
          resources:
            requests:
              cpu: 50m
              memory: 128Mi
            limits:
              memory: 256Mi

PostgreSQL StatefulSet

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: reva
spec:
  template:
    spec:
      containers:
        - name: postgres
          image: pgvector/pgvector:pg16
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              memory: 512Mi
  volumeClaimTemplates:
    - metadata:
        name: pgdata
      spec:
        resources:
          requests:
            storage: 10Gi

Redis

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  namespace: reva
spec:
  template:
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              memory: 128Mi

Total Pod Resource Budget

Resource	Requests (sum)	Limits (sum)
CPU	350m	(no CPU limit)
Memory	704Mi	1.5Gi
Storage	10Gi (PostgreSQL PVC)	—

These are conservative values. For deployments expecting sustained load, consider increasing the Reva app memory limit to 2Gi (observed RSS of 791 MB under load with headroom for spikes).

Cloud LLM Comparison

When local GPU capacity is insufficient, cloud LLM APIs offer unlimited scaling at per-token cost. This comparison assumes the Ollama router model (llama3.2:3b) still runs locally.

Per-Request Cost Estimate

A typical Reva request involves ~4 LLM calls with approximately 4,000 input tokens and 800 output tokens total.

Provider	Model	Input Cost	Output Cost	Cost/Request	Cost/1000 Requests
Local (Ollama)	qwen3:14b	$0	$0	$0	$0 (GPU amortization only)
Anthropic	Claude Sonnet 4	$3/M input	$15/M output	~$0.024	~$24
Anthropic	Claude Haiku 3.5	$0.80/M input	$4/M output	~$0.006	~$6
OpenAI	GPT-4o	$2.50/M input	$10/M output	~$0.018	~$18
OpenAI	GPT-4o-mini	$0.15/M input	$0.60/M output	~$0.001	~$1

Break-Even Analysis

GPU amortization cost over 3 years (typical enterprise hardware lifecycle):

GPU	Purchase Cost	Monthly Amortization	Break-even vs Claude Haiku	Break-even vs GPT-4o-mini
RTX 5070 Ti 16GB	~$800	~$22/month	~3,700 requests	~22,000 requests
RTX 5090 32GB	~$2,000	~$56/month	~9,300 requests	~56,000 requests
L40S 48GB	~$6,000	~$167/month	~27,800 requests	~167,000 requests

When Cloud Makes Sense

Fewer than 100 requests/day and you want to avoid GPU procurement
Burst capacity needed beyond what local GPU can handle
Proof of concept or trial deployments

When Local GPU Makes Sense

More than 100 requests/day sustained
Data sovereignty requirements (no data leaves the network)
Predictable, flat monthly cost preferred over variable per-token billing

Monitoring for Capacity

Reva exposes metrics via GET /api/stats (JSON) and GET /api/metrics (Prometheus). These are the capacity-relevant metrics:

Metric	Source	Warning Threshold	Critical Threshold	Action
`response_time_p95_s`	/api/stats	> 30s	> 60s	GPU upgrade needed
`response_time_p50_s`	/api/stats	> 20s	> 40s	Check for model/config regression
`requests_per_minute`	/api/stats	Approaching 3 (16GB)	Sustained at limit	Scale GPU or add instance
`active_sessions`	/api/stats	> 3 (16GB)	> 5 (16GB)	Users will experience queuing
`llm.response_time_p50_s`	/api/stats	> 25s	> 50s	GPU contention or model swap
`db_pool_checked_out`	/api/stats	> 20 (of 30 max)	> 28	Increase pool_size
`error_count`	/api/stats	Any increase	> 5% error rate	Investigate logs

Prometheus Alerting Rules (Example)

groups:
  - name: reva-capacity
    rules:
      - alert: RevaHighLatency
        expr: reva_request_duration_seconds{quantile="0.95"} > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Reva p95 response time exceeds 30s"

      - alert: RevaVeryHighLatency
        expr: reva_request_duration_seconds{quantile="0.95"} > 60
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Reva p95 response time exceeds 60s"

      - alert: RevaHighErrorRate
        expr: rate(reva_requests_total{status="error"}[5m]) / rate(reva_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Reva error rate exceeds 5%"

      - alert: RevaDBPoolExhaustion
        expr: reva_db_pool_checked_out / reva_db_pool_size > 0.9
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Database connection pool > 90% utilized"

Manual Capacity Check

Run this periodically or after scaling changes:

# Quick capacity snapshot
curl -s http://localhost:3978/api/stats | jq '{
  response_p50: .request_performance.response_time_p50_s,
  response_p95: .request_performance.response_time_p95_s,
  rpm: .request_performance.requests_per_minute,
  active_sessions: .conversations.active_sessions,
  llm_p50: .llm.response_time_p50_s,
  db_pool_used: .infrastructure.db_pool_checked_out,
  db_pool_max: (.infrastructure.db_pool_size + 20),
  process_rss_mb: (.infrastructure.process_rss_bytes / 1048576 | floor)
}'

Storage Planning

PostgreSQL Growth

PostgreSQL stores conversation history, pgvector embeddings, and metadata. Growth depends on usage volume.

Data Type	Size per Unit	Growth Driver
Conversation message	~2 KB	1 row per user message + 1 row per bot response
pgvector embedding	~6 KB (1536 dimensions, float32)	1 per conversation turn (for memory retrieval)
Session metadata	~0.5 KB	1 row per conversation session

Estimated Monthly Growth

Usage Level	Messages/Month	Storage Growth/Month	1-Year Projection
Light (20 users, 2 queries/day)	~1,200	~10 MB	~120 MB
Medium (50 users, 5 queries/day)	~7,500	~60 MB	~720 MB
Heavy (100 users, 10 queries/day)	~30,000	~250 MB	~3 GB

The default PVC size of 10 Gi is sufficient for all but the heaviest deployments over multiple years. Factor in daily compressed backups (~5–10% of DB size per backup, 30-day retention):

DB Size	Backup Size (compressed)	30-Day Retention
500 MB	~50 MB	~1.5 GB
2 GB	~200 MB	~6 GB
5 GB	~500 MB	~15 GB

Ollama Model Storage

Model files are stored on the Ollama host (not in the K8s cluster).

Model	Disk Size
llama3.2:3b (router)	~2 GB
qwen3:14b (agent)	~9 GB
nomic-embed-text (embeddings)	~0.3 GB
Total	~11 GB

Allocate 20 GB minimum for Ollama storage to accommodate model updates and additional models.

Docker Log Storage

Log rotation is configured in docker-compose.yml:

Service	Max Size per File	Max Files	Total Max
Reva	50 MB	5	250 MB
PostgreSQL	20 MB	3	60 MB
Redis	10 MB	3	30 MB
Total			340 MB