Overview
This runbook covers backup, restore, and disaster recovery procedures for Reva. It applies to both Docker Compose and Kubernetes (k3s) deployments.
PostgreSQL is the only stateful component. All other services (Redis, Ollama, MCP servers, the Reva application itself) are stateless and recover automatically on restart.
Recovery Targets
| Metric | Target | Rationale |
|---|---|---|
| RPO (Recovery Point Objective) | 24 hours | Daily backup schedule |
| RTO (Recovery Time Objective) | 30 minutes | Restore + restart + verify |
The RPO of 24 hours reflects the default daily backup schedule. For tighter RPO requirements, increase the backup frequency by adjusting the cron schedule or CronJob interval.
Backup Strategy
Automated daily backups via cron on the production server:
0 2 * * * /home/evdb/roberta/bin/backup-db.sh
The script (bin/backup-db.sh) runs pg_dump inside the reva-postgres container, compresses with gzip, and stores the result in the backups/ directory. Backups older than 30 days are automatically deleted.
Manual backup:
./bin/backup-db.sh
Backup files are named reva_YYYY-MM-DD_HHMMSS.sql.gz and stored in backups/.
A CronJob (k8s/db-backup-cronjob.yaml) runs daily at 02:00 UTC. It uses the pgvector/pgvector:pg16 image to run pg_dump against the postgres service, compresses with gzip, and writes to the db-backups PVC (5Gi, local-path storage class). Backups older than 30 days are pruned automatically.
Trigger a manual backup:
kubectl create job --from=cronjob/db-backup db-backup-manual -n reva
kubectl logs -n reva job/db-backup-manual -f
Retention Policy
| Environment | Location | Retention | Max Storage |
|---|---|---|---|
| Docker Compose | backups/ on host filesystem | 30 days | Unbounded (host disk) |
| Kubernetes | PVC db-backups | 30 days | 5Gi |
Backup Verification
List Available Backups
ls -lht backups/reva_*.sql.gz
kubectl run backup-ls --rm -it --restart=Never -n reva \
--image=pgvector/pgvector:pg16 \
--overrides='{
"spec": {
"containers": [{
"name": "backup-ls",
"image": "pgvector/pgvector:pg16",
"command": ["ls", "-lht", "/backups/"],
"volumeMounts": [{"name": "backups", "mountPath": "/backups"}]
}],
"volumes": [{
"name": "backups",
"persistentVolumeClaim": {"claimName": "db-backups"}
}]
}
}'
Test Restore to a Temporary Database
This validates that a backup file is restorable without touching the production database.
# Pick a backup to verify
BACKUP_FILE="backups/reva_2026-03-15_020000.sql.gz"
# Create a temporary database, restore into it, then drop it
docker exec reva-postgres psql -U postgres -c "CREATE DATABASE reva_dr_test;"
gunzip -c "$BACKUP_FILE" | docker exec -i reva-postgres psql -U postgres -d reva_dr_test
docker exec reva-postgres psql -U postgres -d reva_dr_test \
-c "SELECT count(*) FROM reva_subscriptions;"
docker exec reva-postgres psql -U postgres -c "DROP DATABASE reva_dr_test;"
If the SELECT returns a row count without errors, the backup is valid. Run this check periodically to ensure your backups are restorable.
Restore (Docker Compose)
Prerequisites
- A valid backup file in
backups/(verify withls -lht backups/reva_*.sql.gz) - The
reva-postgrescontainer must be running
Step-by-step
# 1. Choose the backup to restore
BACKUP_FILE="backups/reva_2026-03-15_020000.sql.gz"
# 2. Stop the application (keep the database running)
docker compose stop reva
# 3. Terminate active database connections
docker exec reva-postgres psql -U postgres -c \
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'reva' AND pid <> pg_backend_pid();"
# 4. Restore the backup (--clean --if-exists in the dump handles DROP/CREATE)
gunzip -c "$BACKUP_FILE" | docker exec -i reva-postgres psql -U postgres -d reva
# 5. Restart the application
docker compose up -d reva
# 6. Wait for startup and verify health
sleep 10
curl -s http://localhost:3978/api/health | python3 -m json.tool
Expected Health Response
{
"status": "ok",
"adapter_initialized": true,
"db": true,
"mcp": {
"release": {"connected": true},
"jira": {"connected": true}
}
}
If "db": false: Check PostgreSQL logs immediately with docker compose logs postgres. Common causes include corrupted dump files or insufficient disk space.
Restore (Kubernetes)
Prerequisites
- A valid backup exists on the
db-backupsPVC kubectlconfigured for the target cluster
Step-by-step
# 1. Scale down the application to stop DB connections
kubectl scale deployment reva -n reva --replicas=0
# 2. Identify the backup to restore
kubectl run backup-ls --rm -it --restart=Never -n reva \
--image=pgvector/pgvector:pg16 \
--overrides='{
"spec": {
"containers": [{
"name": "backup-ls",
"image": "pgvector/pgvector:pg16",
"command": ["ls", "-lht", "/backups/"],
"volumeMounts": [{"name": "backups", "mountPath": "/backups"}]
}],
"volumes": [{
"name": "backups",
"persistentVolumeClaim": {"claimName": "db-backups"}
}]
}
}'
# 3. Restore from backup (replace filename as needed)
BACKUP_NAME="reva_2026-03-15_020000.sql.gz"
kubectl run db-restore --rm -it --restart=Never -n reva \
--image=pgvector/pgvector:pg16 \
--overrides='{
"spec": {
"containers": [{
"name": "db-restore",
"image": "pgvector/pgvector:pg16",
"command": ["sh", "-c", "gunzip -c /backups/'"$BACKUP_NAME"' | PGPASSWORD=$POSTGRES_PASSWORD psql -h postgres -U postgres -d reva"],
"env": [{"name": "POSTGRES_PASSWORD", "valueFrom": {"secretKeyRef": {"name": "reva-postgres-admin-password", "key": "password"}}}],
"volumeMounts": [{"name": "backups", "mountPath": "/backups"}]
}],
"volumes": [{
"name": "backups",
"persistentVolumeClaim": {"claimName": "db-backups"}
}]
}
}'
# 4. Scale the application back up
kubectl scale deployment reva -n reva --replicas=1
# 5. Wait for rollout and verify health
kubectl rollout status deployment/reva -n reva --timeout=120s
kubectl exec -n reva deployment/reva -- curl -s http://localhost:3978/api/health
Do not skip step 1. Scaling down the application ensures no active connections interfere with the restore. Running a restore while the app is connected can lead to partial or corrupted state.
Component Recovery Matrix
Use this matrix to quickly determine recovery actions per component during an incident.
| Component | Stateful? | Needs Restore? | Recovery Action |
|---|---|---|---|
| PostgreSQL | Yes | Yes | Restore from backup (see sections 4/5) |
| Redis | No (ephemeral cache) | No | Restarts empty; app rebuilds cache on demand |
| Ollama | No (models on disk) | No | Models re-downloaded automatically if missing (ollama pull) |
| MCP servers (Release, Jira) | No | No | Stateless containers; reconnect on restart |
| Reva application | No | No | Stateless FastAPI app; tables auto-created via metadata.create_all() |
Only PostgreSQL requires active backup and restore. All other components are either stateless or store data that can be regenerated automatically.
Failover Checklist
Quick-reference checklist for incident response:
- Assess — Identify which component is down (
curl /api/health,docker compose psorkubectl get pods -n reva) - Communicate — Notify stakeholders that the bot is temporarily unavailable
- Check PostgreSQL — If DB is down, check logs (
docker compose logs postgres/kubectl logs -n reva statefulset/postgres) - Check recent backup — Confirm latest backup exists and is not empty (see Backup Verification)
- Restore if needed — Follow the restore procedure for your environment (Docker Compose or Kubernetes)
- Verify health —
curl /api/healthmust return"status": "ok"with"db": true - Verify MCP connectivity — Health response should show
"connected": truefor release and jira - Test bot — Send a test message in Teams to confirm end-to-end functionality
- Post-mortem — Document root cause, timeline, and any follow-up actions
Time is critical. Work through this checklist sequentially. Do not skip the communication step — stakeholders should know the bot is being recovered before you begin the restore.
Data Loss Scenarios
Database Corrupted or Lost
Impact: All conversation history, user memories, notification subscriptions, and activity logs are lost.
Recovery: Restore from the most recent backup. Maximum data loss = 24 hours (RPO). The Reva application will auto-create tables on startup if they are missing, so even without a backup the service will start — but with no historical data.
# Docker Compose: full database recreation
docker compose down postgres
docker volume rm reva_postgres-data # Remove corrupted volume
docker compose up -d postgres # Fresh PostgreSQL instance
# Wait for postgres to be ready, then restore:
sleep 5
gunzip -c backups/reva_LATEST.sql.gz | docker exec -i reva-postgres psql -U postgres -d reva
docker compose up -d reva
No Backup Available
Impact: Complete data loss. Historical conversations and memories cannot be recovered.
Recovery: Start fresh. The application creates all required tables on startup. Users will need to re-subscribe to notifications. External systems (Release, Jira) are unaffected since they are the source of truth for release/issue data.
docker compose down
docker volume rm reva_postgres-data
docker compose up -d
kubectl delete pvc postgres-data -n reva
kubectl rollout restart statefulset/postgres -n reva
kubectl rollout restart deployment/reva -n reva
Redis Lost or Corrupted
Impact: Minimal. Redis is used as an ephemeral cache. No data loss occurs.
docker compose restart redis
kubectl rollout restart deployment/redis -n reva
MCP Servers Unreachable
Impact: Bot cannot interact with Digital.ai Release or Jira. Conversations and memory continue to work.
Recovery: MCP servers reconnect automatically when the upstream service becomes available. Check network connectivity and upstream service health.
# Check MCP status
curl -s http://localhost:3978/api/health | python3 -c \
"import sys,json; d=json.load(sys.stdin); print(json.dumps(d.get('mcp',{}), indent=2))"
docker compose restart release-mcp jira-mcp
# MCP sidecars are part of the reva pod
kubectl rollout restart deployment/reva -n reva
Ollama Unavailable
Impact: Bot cannot process messages (LLM inference offline). No data loss.
Recovery: Restart Ollama. Models are stored on disk and persist across restarts. If the model files are lost, they will be re-downloaded on first use.
docker compose restart ollama
sudo systemctl restart ollama
# Verify model availability
curl -s http://localhost:11434/api/tags | python3 -c \
"import sys,json; [print(m['name']) for m in json.load(sys.stdin)['models']]"
Testing DR
Schedule
Perform a full DR test quarterly. Record results in the team's incident log.
Test Procedure
- Create a test backup from the current production database.
- Restore to a separate environment (do not test on production).
# Create a test database alongside production docker exec reva-postgres psql -U postgres -c "CREATE DATABASE reva_dr_test;" gunzip -c backups/reva_LATEST.sql.gz | docker exec -i reva-postgres psql -U postgres -d reva_dr_test - Validate data integrity:
# Check table existence and row counts docker exec reva-postgres psql -U postgres -d reva_dr_test -c "\dt" docker exec reva-postgres psql -U postgres -d reva_dr_test -c " SELECT 'conversations' AS tbl, count(*) FROM conversations UNION ALL SELECT 'memories', count(*) FROM memories UNION ALL SELECT 'reva_subscriptions', count(*) FROM reva_subscriptions; " - Verify pgvector extension:
docker exec reva-postgres psql -U postgres -d reva_dr_test \ -c "SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';" - Clean up:
docker exec reva-postgres psql -U postgres -c "DROP DATABASE reva_dr_test;" - Document results: Record the date, backup file used, whether restore succeeded, any issues encountered, and time taken.
What to Verify
- Backup file decompresses without errors
pg_dumprestore completes without errors- All expected tables exist with reasonable row counts
- pgvector extension is present and functional
- Application starts and passes health check after restore
A passing DR test confirms that your backup strategy works end-to-end. If any step fails, investigate and fix the issue before the next quarterly test.