Disaster Recovery

Overview

This runbook covers backup, restore, and disaster recovery procedures for Reva. It applies to both Docker Compose and Kubernetes (k3s) deployments.

PostgreSQL is the only stateful component. All other services (Redis, Ollama, MCP servers, the Reva application itself) are stateless and recover automatically on restart.

Recovery Targets

Metric	Target	Rationale
RPO (Recovery Point Objective)	24 hours	Daily backup schedule
RTO (Recovery Time Objective)	30 minutes	Restore + restart + verify

The RPO of 24 hours reflects the default daily backup schedule. For tighter RPO requirements, increase the backup frequency by adjusting the cron schedule or CronJob interval.

Backup Strategy

Automated daily backups via cron on the production server:

0 2 * * * /home/evdb/roberta/bin/backup-db.sh

The script (bin/backup-db.sh) runs pg_dump inside the reva-postgres container, compresses with gzip, and stores the result in the backups/ directory. Backups older than 30 days are automatically deleted.

Manual backup:

./bin/backup-db.sh

Backup files are named reva_YYYY-MM-DD_HHMMSS.sql.gz and stored in backups/.

A CronJob (k8s/db-backup-cronjob.yaml) runs daily at 02:00 UTC. It uses the pgvector/pgvector:pg16 image to run pg_dump against the postgres service, compresses with gzip, and writes to the db-backups PVC (5Gi, local-path storage class). Backups older than 30 days are pruned automatically.

Trigger a manual backup:

kubectl create job --from=cronjob/db-backup db-backup-manual -n reva
kubectl logs -n reva job/db-backup-manual -f

Retention Policy

Environment	Location	Retention	Max Storage
Docker Compose	`backups/` on host filesystem	30 days	Unbounded (host disk)
Kubernetes	PVC `db-backups`	30 days	5Gi

Backup Verification

List Available Backups

ls -lht backups/reva_*.sql.gz

kubectl run backup-ls --rm -it --restart=Never -n reva \
  --image=pgvector/pgvector:pg16 \
  --overrides='{
    "spec": {
      "containers": [{
        "name": "backup-ls",
        "image": "pgvector/pgvector:pg16",
        "command": ["ls", "-lht", "/backups/"],
        "volumeMounts": [{"name": "backups", "mountPath": "/backups"}]
      }],
      "volumes": [{
        "name": "backups",
        "persistentVolumeClaim": {"claimName": "db-backups"}
      }]
    }
  }'

Test Restore to a Temporary Database

This validates that a backup file is restorable without touching the production database.

# Pick a backup to verify
BACKUP_FILE="backups/reva_2026-03-15_020000.sql.gz"

# Create a temporary database, restore into it, then drop it
docker exec reva-postgres psql -U postgres -c "CREATE DATABASE reva_dr_test;"
gunzip -c "$BACKUP_FILE" | docker exec -i reva-postgres psql -U postgres -d reva_dr_test
docker exec reva-postgres psql -U postgres -d reva_dr_test \
  -c "SELECT count(*) FROM reva_subscriptions;"
docker exec reva-postgres psql -U postgres -c "DROP DATABASE reva_dr_test;"

If the SELECT returns a row count without errors, the backup is valid. Run this check periodically to ensure your backups are restorable.

Restore (Docker Compose)

Prerequisites

A valid backup file in backups/ (verify with ls -lht backups/reva_*.sql.gz)
The reva-postgres container must be running

Step-by-step

# 1. Choose the backup to restore
BACKUP_FILE="backups/reva_2026-03-15_020000.sql.gz"

# 2. Stop the application (keep the database running)
docker compose stop reva

# 3. Terminate active database connections
docker exec reva-postgres psql -U postgres -c \
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'reva' AND pid <> pg_backend_pid();"

# 4. Restore the backup (--clean --if-exists in the dump handles DROP/CREATE)
gunzip -c "$BACKUP_FILE" | docker exec -i reva-postgres psql -U postgres -d reva

# 5. Restart the application
docker compose up -d reva

# 6. Wait for startup and verify health
sleep 10
curl -s http://localhost:3978/api/health | python3 -m json.tool

Expected Health Response

{
  "status": "ok",
  "adapter_initialized": true,
  "db": true,
  "mcp": {
    "release": {"connected": true},
    "jira": {"connected": true}
  }
}

If "db": false: Check PostgreSQL logs immediately with docker compose logs postgres. Common causes include corrupted dump files or insufficient disk space.

Restore (Kubernetes)

Prerequisites

A valid backup exists on the db-backups PVC
kubectl configured for the target cluster

Step-by-step

# 1. Scale down the application to stop DB connections
kubectl scale deployment reva -n reva --replicas=0

# 2. Identify the backup to restore
kubectl run backup-ls --rm -it --restart=Never -n reva \
  --image=pgvector/pgvector:pg16 \
  --overrides='{
    "spec": {
      "containers": [{
        "name": "backup-ls",
        "image": "pgvector/pgvector:pg16",
        "command": ["ls", "-lht", "/backups/"],
        "volumeMounts": [{"name": "backups", "mountPath": "/backups"}]
      }],
      "volumes": [{
        "name": "backups",
        "persistentVolumeClaim": {"claimName": "db-backups"}
      }]
    }
  }'

# 3. Restore from backup (replace filename as needed)
BACKUP_NAME="reva_2026-03-15_020000.sql.gz"

kubectl run db-restore --rm -it --restart=Never -n reva \
  --image=pgvector/pgvector:pg16 \
  --overrides='{
    "spec": {
      "containers": [{
        "name": "db-restore",
        "image": "pgvector/pgvector:pg16",
        "command": ["sh", "-c", "gunzip -c /backups/'"$BACKUP_NAME"' | PGPASSWORD=$POSTGRES_PASSWORD psql -h postgres -U postgres -d reva"],
        "env": [{"name": "POSTGRES_PASSWORD", "valueFrom": {"secretKeyRef": {"name": "reva-postgres-admin-password", "key": "password"}}}],
        "volumeMounts": [{"name": "backups", "mountPath": "/backups"}]
      }],
      "volumes": [{
        "name": "backups",
        "persistentVolumeClaim": {"claimName": "db-backups"}
      }]
    }
  }'

# 4. Scale the application back up
kubectl scale deployment reva -n reva --replicas=1

# 5. Wait for rollout and verify health
kubectl rollout status deployment/reva -n reva --timeout=120s
kubectl exec -n reva deployment/reva -- curl -s http://localhost:3978/api/health

Do not skip step 1. Scaling down the application ensures no active connections interfere with the restore. Running a restore while the app is connected can lead to partial or corrupted state.

Component Recovery Matrix

Use this matrix to quickly determine recovery actions per component during an incident.

Component	Stateful?	Needs Restore?	Recovery Action
PostgreSQL	Yes	Yes	Restore from backup (see sections 4/5)
Redis	No (ephemeral cache)	No	Restarts empty; app rebuilds cache on demand
Ollama	No (models on disk)	No	Models re-downloaded automatically if missing (`ollama pull`)
MCP servers (Release, Jira)	No	No	Stateless containers; reconnect on restart
Reva application	No	No	Stateless FastAPI app; tables auto-created via `metadata.create_all()`

Only PostgreSQL requires active backup and restore. All other components are either stateless or store data that can be regenerated automatically.

Failover Checklist

Quick-reference checklist for incident response:

Assess — Identify which component is down (curl /api/health, docker compose ps or kubectl get pods -n reva)
Communicate — Notify stakeholders that the bot is temporarily unavailable
Check PostgreSQL — If DB is down, check logs (docker compose logs postgres / kubectl logs -n reva statefulset/postgres)
Check recent backup — Confirm latest backup exists and is not empty (see Backup Verification)
Restore if needed — Follow the restore procedure for your environment (Docker Compose or Kubernetes)
Verify health — curl /api/health must return "status": "ok" with "db": true
Verify MCP connectivity — Health response should show "connected": true for release and jira
Test bot — Send a test message in Teams to confirm end-to-end functionality
Post-mortem — Document root cause, timeline, and any follow-up actions

Time is critical. Work through this checklist sequentially. Do not skip the communication step — stakeholders should know the bot is being recovered before you begin the restore.

Data Loss Scenarios

Database Corrupted or Lost

Impact: All conversation history, user memories, notification subscriptions, and activity logs are lost.

Recovery: Restore from the most recent backup. Maximum data loss = 24 hours (RPO). The Reva application will auto-create tables on startup if they are missing, so even without a backup the service will start — but with no historical data.

# Docker Compose: full database recreation
docker compose down postgres
docker volume rm reva_postgres-data   # Remove corrupted volume
docker compose up -d postgres         # Fresh PostgreSQL instance
# Wait for postgres to be ready, then restore:
sleep 5
gunzip -c backups/reva_LATEST.sql.gz | docker exec -i reva-postgres psql -U postgres -d reva
docker compose up -d reva

No Backup Available

Impact: Complete data loss. Historical conversations and memories cannot be recovered.

Recovery: Start fresh. The application creates all required tables on startup. Users will need to re-subscribe to notifications. External systems (Release, Jira) are unaffected since they are the source of truth for release/issue data.

docker compose down
docker volume rm reva_postgres-data
docker compose up -d

kubectl delete pvc postgres-data -n reva
kubectl rollout restart statefulset/postgres -n reva
kubectl rollout restart deployment/reva -n reva

Redis Lost or Corrupted

Impact: Minimal. Redis is used as an ephemeral cache. No data loss occurs.

docker compose restart redis

kubectl rollout restart deployment/redis -n reva

MCP Servers Unreachable

Impact: Bot cannot interact with Digital.ai Release or Jira. Conversations and memory continue to work.

Recovery: MCP servers reconnect automatically when the upstream service becomes available. Check network connectivity and upstream service health.

# Check MCP status
curl -s http://localhost:3978/api/health | python3 -c \
  "import sys,json; d=json.load(sys.stdin); print(json.dumps(d.get('mcp',{}), indent=2))"

docker compose restart release-mcp jira-mcp

# MCP sidecars are part of the reva pod
kubectl rollout restart deployment/reva -n reva

Ollama Unavailable

Impact: Bot cannot process messages (LLM inference offline). No data loss.

Recovery: Restart Ollama. Models are stored on disk and persist across restarts. If the model files are lost, they will be re-downloaded on first use.

docker compose restart ollama

sudo systemctl restart ollama

# Verify model availability
curl -s http://localhost:11434/api/tags | python3 -c \
  "import sys,json; [print(m['name']) for m in json.load(sys.stdin)['models']]"

Testing DR

Schedule

Perform a full DR test quarterly. Record results in the team's incident log.

Test Procedure

Create a test backup from the current production database.

Restore to a separate environment (do not test on production).

# Create a test database alongside production
docker exec reva-postgres psql -U postgres -c "CREATE DATABASE reva_dr_test;"
gunzip -c backups/reva_LATEST.sql.gz | docker exec -i reva-postgres psql -U postgres -d reva_dr_test

Validate data integrity:

# Check table existence and row counts
docker exec reva-postgres psql -U postgres -d reva_dr_test -c "\dt"
docker exec reva-postgres psql -U postgres -d reva_dr_test -c "
  SELECT 'conversations' AS tbl, count(*) FROM conversations
  UNION ALL SELECT 'memories', count(*) FROM memories
  UNION ALL SELECT 'reva_subscriptions', count(*) FROM reva_subscriptions;
"

Verify pgvector extension:

docker exec reva-postgres psql -U postgres -d reva_dr_test \
  -c "SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';"

Clean up:

docker exec reva-postgres psql -U postgres -c "DROP DATABASE reva_dr_test;"

Document results: Record the date, backup file used, whether restore succeeded, any issues encountered, and time taken.

What to Verify

Backup file decompresses without errors
pg_dump restore completes without errors
All expected tables exist with reasonable row counts
pgvector extension is present and functional
Application starts and passes health check after restore

A passing DR test confirms that your backup strategy works end-to-end. If any step fails, investigate and fix the issue before the next quarterly test.