Architecture Summary
Reva consists of six components that share host resources:
+-----------+
Microsoft Teams ---->| Reva App |----> Ollama (GPU)
| FastAPI | - Router: llama3.2:3b
| Python | - Agent: qwen3:14b
+-----+-----+
|
+----------+----------+
| | |
+-----+--+ +----+---+ +----+---+
|PostgreSQL| | Redis | | MCP |
| pgvector | | cache | | servers|
+----------+ +--------+ +--------+
(Release, Jira)
Resource consumption profile:
- GPU (Ollama): The dominant bottleneck. All LLM inference is serialized through the GPU. Every user request requires 2–4 LLM calls (1 router classification + 1–3 agent steps).
- CPU/RAM (Reva App): Lightweight async Python process. Spends most time waiting on GPU and MCP responses.
- CPU/RAM (MCP servers): Two sidecar containers handling API calls to Digital.ai Release and Jira. Median latency 63ms.
- CPU/RAM (PostgreSQL): Conversation history and pgvector embeddings. Sub-millisecond query times. Never a bottleneck in testing.
- CPU/RAM (Redis): Session cache. Sub-millisecond latency. Negligible resource usage.
Component Requirements
Minimum resources per component for a single-instance deployment:
| Component | CPU Request | CPU Limit | Memory Request | Memory Limit | Storage | Notes |
|---|---|---|---|---|---|---|
| Reva App | 100m | 1000m | 256Mi | 1Gi | Minimal | Async I/O; ~791 MB RSS observed under load |
| PostgreSQL | 100m | 500m | 256Mi | 512Mi | 10Gi PVC | pgvector index grows with conversation count |
| Redis | 50m | 200m | 64Mi | 128Mi | Ephemeral | Session cache only; data loss is non-critical |
| Release MCP | 50m | 500m | 128Mi | 256Mi | None | Sidecar; REST calls to Release server |
| Jira MCP | 50m | 500m | 128Mi | 256Mi | None | Sidecar; REST calls to Jira instance |
| Ollama | 1000m | — | 2Gi | — | 10–20Gi | Model files stored on disk; GPU is the real resource |
Total minimum (excluding GPU): ~1.5 CPU cores, ~3 GB RAM, ~20 GB storage. The GPU is not listed with CPU/memory limits because it runs on the host (or a dedicated node) and uses VRAM as its primary resource.
GPU Sizing
The GPU is the single most important capacity decision. All other components are cheap relative to the GPU.
| GPU VRAM | Example GPUs | Agent Model | NUM_PARALLEL | Concurrent Users | Response Time (p50) | Viable? |
|---|---|---|---|---|---|---|
| 8 GB | RTX 5060, RTX 4060 | qwen3:8b | 1 | 1 | ~15s (estimated) | Basic use only |
| 12 GB | RTX 5070, RTX 4070 | qwen3:14b (tight) | 1 | 1 | ~25s (estimated) | Marginal; KV cache pressure |
| 16 GB | RTX 5070 Ti, RTX 5080 | qwen3:14b | 1 | 1–2 | 22s (measured) | Tested baseline |
| 24 GB | RTX 5090, A10, L4 | qwen3:14b | 2 | 3–5 | ~12s (projected) | Recommended |
| 48 GB | L40S, A6000 | qwen3:14b + vLLM | Batched | 10+ | ~8s (projected) | Multi-user production |
| 2x GPU | Any combination | Dedicated router + agent | 1 each | 3–5 | ~18s (projected) | Eliminates contention |
Key Findings from Testing (RTX 5070 Ti 16GB)
- Single user: 22s median end-to-end (4 agent steps, 1 tool call).
- Throughput ceiling: 0.05 QPS (constant regardless of concurrency) = ~3 requests/min.
- Linear degradation: Each additional concurrent user adds ~20s to wait time due to GPU queue serialization.
- 5 concurrent users: 40% of requests timed out at the 120s client timeout.
- NUM_PARALLEL=2 on 16GB: 5x slower (110s vs 22s). The qwen3:14b model uses ~9 GB for weights, leaving only ~7 GB for KV cache. Two parallel sequences overflow VRAM and trigger CPU offloading.
When to Use Which GPU Tier
- 8 GB: Proof of concept or personal use. Must use a smaller model (qwen3:8b). Expect reduced tool-calling accuracy.
- 16 GB: Small team (< 20 users), light usage. Acceptable if most users send only 1–2 queries per day.
- 24 GB: Medium team (20–100 users), moderate usage. Enables NUM_PARALLEL=2 without KV cache overflow. First tier where vLLM becomes viable.
- 48 GB: Large deployment or heavy usage. vLLM continuous batching provides near-linear throughput scaling up to 10+ concurrent users.
Users per Instance
“Concurrent users” is different from “total users.” Most enterprise users interact with Reva a few times per day. The table below maps total Teams users to required infrastructure, based on measured throughput of ~3 requests/min on a 16GB GPU.
| Usage Pattern | Queries/User/Hour | Peak Concurrent (est.) | Max Users (16GB) | Max Users (24GB) | Max Users (48GB) |
|---|---|---|---|---|---|
| Light | < 1 | 1 | 50–100 | 100–200 | 300+ |
| Medium | 1–5 | 2–3 | 20–50 | 50–100 | 150–250 |
| Heavy | 5–20 | 5–10 | 10–20 | 20–50 | 75–150 |
| Power users | 20+ | 10+ | 5–10 | 10–20 | 50–75 |
How to Estimate Your Usage Pattern
- Count your total Reva-eligible users (release managers, ops engineers, etc.).
- Estimate peak hour load: typically 10–20% of users are active in the busiest hour.
- Multiply active users by average queries per hour.
- If peak queries/minute exceeds 3 (16GB) or 6 (24GB), you need a larger GPU or multiple instances.
Example: 40 release managers, medium usage (3 queries/hour each during peak). Peak concurrent = 40 × 0.15 × 3 / 60 = 0.3 queries/sec = 18 queries/min. This exceeds 16GB capacity (3 req/min) and 24GB capacity (~6 req/min). Solution: 48GB GPU or multiple instances.
Scaling Decision Tree
Use this tree when response times or timeouts indicate scaling is needed:
Scaling Options, in Priority Order
| Priority | Action | Cost | Expected Improvement | When to Use |
|---|---|---|---|---|
| 1 | Upgrade GPU to 24GB | $800–1500 | 2x concurrent capacity | First scaling step from 16GB |
| 2 | Enable NUM_PARALLEL=2 (24GB+) | Free | ~2x throughput | After GPU upgrade |
| 3 | Use a faster/smaller model | Free | 30–50% latency reduction | If accuracy is acceptable with qwen3:8b |
| 4 | Add second GPU for router | $200–500 | Eliminates router/agent contention | If router latency > 2s under load |
| 5 | Switch to vLLM (48GB+) | $3000–6000 (GPU) | 5–10x throughput | High-concurrency deployments |
| 6 | Deploy multiple instances | 2x infra | Linear capacity scaling | When single-GPU scaling is exhausted |
| 7 | Use cloud LLM (Claude/OpenAI) | Per-token cost | Unlimited scaling | See cost comparison below |
K8s Resource Examples
Production-tested resource specifications from the project’s Kubernetes manifests.
Reva Application Pod (includes MCP sidecars)
apiVersion: apps/v1
kind: Deployment
metadata:
name: reva
namespace: reva
spec:
replicas: 1
template:
spec:
containers:
- name: reva
image: reva:latest
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
memory: 1Gi
- name: release-mcp
image: xebialabsearlyaccess/dai-release-mcp:25.3.0-beta.926
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
memory: 256Mi
- name: jira-mcp
image: ghcr.io/sooperset/mcp-atlassian:0.21.0
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
memory: 256Mi
PostgreSQL StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: reva
spec:
template:
spec:
containers:
- name: postgres
image: pgvector/pgvector:pg16
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
memory: 512Mi
volumeClaimTemplates:
- metadata:
name: pgdata
spec:
resources:
requests:
storage: 10Gi
Redis
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
namespace: reva
spec:
template:
spec:
containers:
- name: redis
image: redis:7-alpine
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
memory: 128Mi
Total Pod Resource Budget
| Resource | Requests (sum) | Limits (sum) |
|---|---|---|
| CPU | 350m | (no CPU limit) |
| Memory | 704Mi | 1.5Gi |
| Storage | 10Gi (PostgreSQL PVC) | — |
These are conservative values. For deployments expecting sustained load, consider increasing the Reva app memory limit to 2Gi (observed RSS of 791 MB under load with headroom for spikes).
Cloud LLM Comparison
When local GPU capacity is insufficient, cloud LLM APIs offer unlimited scaling at per-token cost. This comparison assumes the Ollama router model (llama3.2:3b) still runs locally.
Per-Request Cost Estimate
A typical Reva request involves ~4 LLM calls with approximately 4,000 input tokens and 800 output tokens total.
| Provider | Model | Input Cost | Output Cost | Cost/Request | Cost/1000 Requests |
|---|---|---|---|---|---|
| Local (Ollama) | qwen3:14b | $0 | $0 | $0 | $0 (GPU amortization only) |
| Anthropic | Claude Sonnet 4 | $3/M input | $15/M output | ~$0.024 | ~$24 |
| Anthropic | Claude Haiku 3.5 | $0.80/M input | $4/M output | ~$0.006 | ~$6 |
| OpenAI | GPT-4o | $2.50/M input | $10/M output | ~$0.018 | ~$18 |
| OpenAI | GPT-4o-mini | $0.15/M input | $0.60/M output | ~$0.001 | ~$1 |
Break-Even Analysis
GPU amortization cost over 3 years (typical enterprise hardware lifecycle):
| GPU | Purchase Cost | Monthly Amortization | Break-even vs Claude Haiku | Break-even vs GPT-4o-mini |
|---|---|---|---|---|
| RTX 5070 Ti 16GB | ~$800 | ~$22/month | ~3,700 requests | ~22,000 requests |
| RTX 5090 32GB | ~$2,000 | ~$56/month | ~9,300 requests | ~56,000 requests |
| L40S 48GB | ~$6,000 | ~$167/month | ~27,800 requests | ~167,000 requests |
When Cloud Makes Sense
- Fewer than 100 requests/day and you want to avoid GPU procurement
- Burst capacity needed beyond what local GPU can handle
- Proof of concept or trial deployments
When Local GPU Makes Sense
- More than 100 requests/day sustained
- Data sovereignty requirements (no data leaves the network)
- Predictable, flat monthly cost preferred over variable per-token billing
Monitoring for Capacity
Reva exposes metrics via GET /api/stats (JSON) and GET /api/metrics (Prometheus). These are the capacity-relevant metrics:
| Metric | Source | Warning Threshold | Critical Threshold | Action |
|---|---|---|---|---|
response_time_p95_s |
/api/stats | > 30s | > 60s | GPU upgrade needed |
response_time_p50_s |
/api/stats | > 20s | > 40s | Check for model/config regression |
requests_per_minute |
/api/stats | Approaching 3 (16GB) | Sustained at limit | Scale GPU or add instance |
active_sessions |
/api/stats | > 3 (16GB) | > 5 (16GB) | Users will experience queuing |
llm.response_time_p50_s |
/api/stats | > 25s | > 50s | GPU contention or model swap |
db_pool_checked_out |
/api/stats | > 20 (of 30 max) | > 28 | Increase pool_size |
error_count |
/api/stats | Any increase | > 5% error rate | Investigate logs |
Prometheus Alerting Rules (Example)
groups:
- name: reva-capacity
rules:
- alert: RevaHighLatency
expr: reva_request_duration_seconds{quantile="0.95"} > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Reva p95 response time exceeds 30s"
- alert: RevaVeryHighLatency
expr: reva_request_duration_seconds{quantile="0.95"} > 60
for: 5m
labels:
severity: critical
annotations:
summary: "Reva p95 response time exceeds 60s"
- alert: RevaHighErrorRate
expr: rate(reva_requests_total{status="error"}[5m]) / rate(reva_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Reva error rate exceeds 5%"
- alert: RevaDBPoolExhaustion
expr: reva_db_pool_checked_out / reva_db_pool_size > 0.9
for: 1m
labels:
severity: warning
annotations:
summary: "Database connection pool > 90% utilized"
Manual Capacity Check
Run this periodically or after scaling changes:
# Quick capacity snapshot
curl -s http://localhost:3978/api/stats | jq '{
response_p50: .request_performance.response_time_p50_s,
response_p95: .request_performance.response_time_p95_s,
rpm: .request_performance.requests_per_minute,
active_sessions: .conversations.active_sessions,
llm_p50: .llm.response_time_p50_s,
db_pool_used: .infrastructure.db_pool_checked_out,
db_pool_max: (.infrastructure.db_pool_size + 20),
process_rss_mb: (.infrastructure.process_rss_bytes / 1048576 | floor)
}'
Storage Planning
PostgreSQL Growth
PostgreSQL stores conversation history, pgvector embeddings, and metadata. Growth depends on usage volume.
| Data Type | Size per Unit | Growth Driver |
|---|---|---|
| Conversation message | ~2 KB | 1 row per user message + 1 row per bot response |
| pgvector embedding | ~6 KB (1536 dimensions, float32) | 1 per conversation turn (for memory retrieval) |
| Session metadata | ~0.5 KB | 1 row per conversation session |
Estimated Monthly Growth
| Usage Level | Messages/Month | Storage Growth/Month | 1-Year Projection |
|---|---|---|---|
| Light (20 users, 2 queries/day) | ~1,200 | ~10 MB | ~120 MB |
| Medium (50 users, 5 queries/day) | ~7,500 | ~60 MB | ~720 MB |
| Heavy (100 users, 10 queries/day) | ~30,000 | ~250 MB | ~3 GB |
The default PVC size of 10 Gi is sufficient for all but the heaviest deployments over multiple years. Factor in daily compressed backups (~5–10% of DB size per backup, 30-day retention):
| DB Size | Backup Size (compressed) | 30-Day Retention |
|---|---|---|
| 500 MB | ~50 MB | ~1.5 GB |
| 2 GB | ~200 MB | ~6 GB |
| 5 GB | ~500 MB | ~15 GB |
Ollama Model Storage
Model files are stored on the Ollama host (not in the K8s cluster).
| Model | Disk Size |
|---|---|
| llama3.2:3b (router) | ~2 GB |
| qwen3:14b (agent) | ~9 GB |
| nomic-embed-text (embeddings) | ~0.3 GB |
| Total | ~11 GB |
Allocate 20 GB minimum for Ollama storage to accommodate model updates and additional models.
Docker Log Storage
Log rotation is configured in docker-compose.yml:
| Service | Max Size per File | Max Files | Total Max |
|---|---|---|---|
| Reva | 50 MB | 5 | 250 MB |
| PostgreSQL | 20 MB | 3 | 60 MB |
| Redis | 10 MB | 3 | 30 MB |
| Total | 340 MB |