Contents
  1. Architecture Summary
  2. Component Requirements
  3. GPU Sizing
  4. Users per Instance
  5. Scaling Decision Tree
  6. K8s Resource Examples
  7. Cloud LLM Comparison
  8. Monitoring for Capacity
  9. Storage Planning

Architecture Summary

Reva consists of six components that share host resources:

                     +-----------+
Microsoft Teams ---->| Reva App  |----> Ollama (GPU)
                     |  FastAPI  |       - Router: llama3.2:3b
                     |  Python   |       - Agent:  qwen3:14b
                     +-----+-----+
                           |
                +----------+----------+
                |          |          |
          +-----+--+ +----+---+ +----+---+
          |PostgreSQL| | Redis  | |  MCP   |
          | pgvector | |  cache | | servers|
          +----------+ +--------+ +--------+
                                    (Release, Jira)

Resource consumption profile:

Component Requirements

Minimum resources per component for a single-instance deployment:

Component CPU Request CPU Limit Memory Request Memory Limit Storage Notes
Reva App 100m 1000m 256Mi 1Gi Minimal Async I/O; ~791 MB RSS observed under load
PostgreSQL 100m 500m 256Mi 512Mi 10Gi PVC pgvector index grows with conversation count
Redis 50m 200m 64Mi 128Mi Ephemeral Session cache only; data loss is non-critical
Release MCP 50m 500m 128Mi 256Mi None Sidecar; REST calls to Release server
Jira MCP 50m 500m 128Mi 256Mi None Sidecar; REST calls to Jira instance
Ollama 1000m 2Gi 10–20Gi Model files stored on disk; GPU is the real resource

Total minimum (excluding GPU): ~1.5 CPU cores, ~3 GB RAM, ~20 GB storage. The GPU is not listed with CPU/memory limits because it runs on the host (or a dedicated node) and uses VRAM as its primary resource.

GPU Sizing

The GPU is the single most important capacity decision. All other components are cheap relative to the GPU.

GPU VRAM Example GPUs Agent Model NUM_PARALLEL Concurrent Users Response Time (p50) Viable?
8 GB RTX 5060, RTX 4060 qwen3:8b 1 1 ~15s (estimated) Basic use only
12 GB RTX 5070, RTX 4070 qwen3:14b (tight) 1 1 ~25s (estimated) Marginal; KV cache pressure
16 GB RTX 5070 Ti, RTX 5080 qwen3:14b 1 1–2 22s (measured) Tested baseline
24 GB RTX 5090, A10, L4 qwen3:14b 2 3–5 ~12s (projected) Recommended
48 GB L40S, A6000 qwen3:14b + vLLM Batched 10+ ~8s (projected) Multi-user production
2x GPU Any combination Dedicated router + agent 1 each 3–5 ~18s (projected) Eliminates contention

Key Findings from Testing (RTX 5070 Ti 16GB)

When to Use Which GPU Tier

Users per Instance

“Concurrent users” is different from “total users.” Most enterprise users interact with Reva a few times per day. The table below maps total Teams users to required infrastructure, based on measured throughput of ~3 requests/min on a 16GB GPU.

Usage Pattern Queries/User/Hour Peak Concurrent (est.) Max Users (16GB) Max Users (24GB) Max Users (48GB)
Light < 1 1 50–100 100–200 300+
Medium 1–5 2–3 20–50 50–100 150–250
Heavy 5–20 5–10 10–20 20–50 75–150
Power users 20+ 10+ 5–10 10–20 50–75

How to Estimate Your Usage Pattern

  1. Count your total Reva-eligible users (release managers, ops engineers, etc.).
  2. Estimate peak hour load: typically 10–20% of users are active in the busiest hour.
  3. Multiply active users by average queries per hour.
  4. If peak queries/minute exceeds 3 (16GB) or 6 (24GB), you need a larger GPU or multiple instances.

Example: 40 release managers, medium usage (3 queries/hour each during peak). Peak concurrent = 40 × 0.15 × 3 / 60 = 0.3 queries/sec = 18 queries/min. This exceeds 16GB capacity (3 req/min) and 24GB capacity (~6 req/min). Solution: 48GB GPU or multiple instances.

Scaling Decision Tree

Use this tree when response times or timeouts indicate scaling is needed:

Is p95 response time > 30s? ├── YES ─ Are you on 16GB VRAM? │ ├── YES ─ Upgrade to 24GB GPU (first priority) │ └── NO ─ Are you on 24GB+ VRAM with NUM_PARALLEL=1? │ ├── YES ─ Enable NUM_PARALLEL=2, retest │ └── NO ─ Is GPU utilization > 90%? │ ├── YES ─ Consider vLLM (48GB+) or add a second instance │ └── NO ─ Check MCP/DB latency — bottleneck may be elsewhere └── NO ─ Is error rate > 5%? ├── YES ─ Check HTTP timeouts (increase client timeout) │ Check DB pool exhaustion (increase pool_size) │ Check MCP server crashes (review container logs) └── NO ─ Current capacity is adequate. Monitor trends.

Scaling Options, in Priority Order

Priority Action Cost Expected Improvement When to Use
1 Upgrade GPU to 24GB $800–1500 2x concurrent capacity First scaling step from 16GB
2 Enable NUM_PARALLEL=2 (24GB+) Free ~2x throughput After GPU upgrade
3 Use a faster/smaller model Free 30–50% latency reduction If accuracy is acceptable with qwen3:8b
4 Add second GPU for router $200–500 Eliminates router/agent contention If router latency > 2s under load
5 Switch to vLLM (48GB+) $3000–6000 (GPU) 5–10x throughput High-concurrency deployments
6 Deploy multiple instances 2x infra Linear capacity scaling When single-GPU scaling is exhausted
7 Use cloud LLM (Claude/OpenAI) Per-token cost Unlimited scaling See cost comparison below

K8s Resource Examples

Production-tested resource specifications from the project’s Kubernetes manifests.

Reva Application Pod (includes MCP sidecars)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: reva
  namespace: reva
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: reva
          image: reva:latest
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              memory: 1Gi

        - name: release-mcp
          image: xebialabsearlyaccess/dai-release-mcp:25.3.0-beta.926
          resources:
            requests:
              cpu: 50m
              memory: 128Mi
            limits:
              memory: 256Mi

        - name: jira-mcp
          image: ghcr.io/sooperset/mcp-atlassian:0.21.0
          resources:
            requests:
              cpu: 50m
              memory: 128Mi
            limits:
              memory: 256Mi

PostgreSQL StatefulSet

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: reva
spec:
  template:
    spec:
      containers:
        - name: postgres
          image: pgvector/pgvector:pg16
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              memory: 512Mi
  volumeClaimTemplates:
    - metadata:
        name: pgdata
      spec:
        resources:
          requests:
            storage: 10Gi

Redis

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  namespace: reva
spec:
  template:
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              memory: 128Mi

Total Pod Resource Budget

Resource Requests (sum) Limits (sum)
CPU 350m (no CPU limit)
Memory 704Mi 1.5Gi
Storage 10Gi (PostgreSQL PVC)

These are conservative values. For deployments expecting sustained load, consider increasing the Reva app memory limit to 2Gi (observed RSS of 791 MB under load with headroom for spikes).

Cloud LLM Comparison

When local GPU capacity is insufficient, cloud LLM APIs offer unlimited scaling at per-token cost. This comparison assumes the Ollama router model (llama3.2:3b) still runs locally.

Per-Request Cost Estimate

A typical Reva request involves ~4 LLM calls with approximately 4,000 input tokens and 800 output tokens total.

Provider Model Input Cost Output Cost Cost/Request Cost/1000 Requests
Local (Ollama) qwen3:14b $0 $0 $0 $0 (GPU amortization only)
Anthropic Claude Sonnet 4 $3/M input $15/M output ~$0.024 ~$24
Anthropic Claude Haiku 3.5 $0.80/M input $4/M output ~$0.006 ~$6
OpenAI GPT-4o $2.50/M input $10/M output ~$0.018 ~$18
OpenAI GPT-4o-mini $0.15/M input $0.60/M output ~$0.001 ~$1

Break-Even Analysis

GPU amortization cost over 3 years (typical enterprise hardware lifecycle):

GPU Purchase Cost Monthly Amortization Break-even vs Claude Haiku Break-even vs GPT-4o-mini
RTX 5070 Ti 16GB ~$800 ~$22/month ~3,700 requests ~22,000 requests
RTX 5090 32GB ~$2,000 ~$56/month ~9,300 requests ~56,000 requests
L40S 48GB ~$6,000 ~$167/month ~27,800 requests ~167,000 requests

When Cloud Makes Sense

When Local GPU Makes Sense

Monitoring for Capacity

Reva exposes metrics via GET /api/stats (JSON) and GET /api/metrics (Prometheus). These are the capacity-relevant metrics:

Metric Source Warning Threshold Critical Threshold Action
response_time_p95_s /api/stats > 30s > 60s GPU upgrade needed
response_time_p50_s /api/stats > 20s > 40s Check for model/config regression
requests_per_minute /api/stats Approaching 3 (16GB) Sustained at limit Scale GPU or add instance
active_sessions /api/stats > 3 (16GB) > 5 (16GB) Users will experience queuing
llm.response_time_p50_s /api/stats > 25s > 50s GPU contention or model swap
db_pool_checked_out /api/stats > 20 (of 30 max) > 28 Increase pool_size
error_count /api/stats Any increase > 5% error rate Investigate logs

Prometheus Alerting Rules (Example)

groups:
  - name: reva-capacity
    rules:
      - alert: RevaHighLatency
        expr: reva_request_duration_seconds{quantile="0.95"} > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Reva p95 response time exceeds 30s"

      - alert: RevaVeryHighLatency
        expr: reva_request_duration_seconds{quantile="0.95"} > 60
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Reva p95 response time exceeds 60s"

      - alert: RevaHighErrorRate
        expr: rate(reva_requests_total{status="error"}[5m]) / rate(reva_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Reva error rate exceeds 5%"

      - alert: RevaDBPoolExhaustion
        expr: reva_db_pool_checked_out / reva_db_pool_size > 0.9
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Database connection pool > 90% utilized"

Manual Capacity Check

Run this periodically or after scaling changes:

# Quick capacity snapshot
curl -s http://localhost:3978/api/stats | jq '{
  response_p50: .request_performance.response_time_p50_s,
  response_p95: .request_performance.response_time_p95_s,
  rpm: .request_performance.requests_per_minute,
  active_sessions: .conversations.active_sessions,
  llm_p50: .llm.response_time_p50_s,
  db_pool_used: .infrastructure.db_pool_checked_out,
  db_pool_max: (.infrastructure.db_pool_size + 20),
  process_rss_mb: (.infrastructure.process_rss_bytes / 1048576 | floor)
}'

Storage Planning

PostgreSQL Growth

PostgreSQL stores conversation history, pgvector embeddings, and metadata. Growth depends on usage volume.

Data Type Size per Unit Growth Driver
Conversation message ~2 KB 1 row per user message + 1 row per bot response
pgvector embedding ~6 KB (1536 dimensions, float32) 1 per conversation turn (for memory retrieval)
Session metadata ~0.5 KB 1 row per conversation session

Estimated Monthly Growth

Usage Level Messages/Month Storage Growth/Month 1-Year Projection
Light (20 users, 2 queries/day) ~1,200 ~10 MB ~120 MB
Medium (50 users, 5 queries/day) ~7,500 ~60 MB ~720 MB
Heavy (100 users, 10 queries/day) ~30,000 ~250 MB ~3 GB

The default PVC size of 10 Gi is sufficient for all but the heaviest deployments over multiple years. Factor in daily compressed backups (~5–10% of DB size per backup, 30-day retention):

DB Size Backup Size (compressed) 30-Day Retention
500 MB ~50 MB ~1.5 GB
2 GB ~200 MB ~6 GB
5 GB ~500 MB ~15 GB

Ollama Model Storage

Model files are stored on the Ollama host (not in the K8s cluster).

Model Disk Size
llama3.2:3b (router) ~2 GB
qwen3:14b (agent) ~9 GB
nomic-embed-text (embeddings) ~0.3 GB
Total ~11 GB

Allocate 20 GB minimum for Ollama storage to accommodate model updates and additional models.

Docker Log Storage

Log rotation is configured in docker-compose.yml:

Service Max Size per File Max Files Total Max
Reva 50 MB 5 250 MB
PostgreSQL 20 MB 3 60 MB
Redis 10 MB 3 30 MB
Total 340 MB

This website does not use cookies or tracking technologies. All fonts are self-hosted; no data is transferred to third parties. See our Privacy Policy for details.