Operations Guide

Health Check

The health endpoint verifies database connectivity, Teams adapter initialization, and MCP server status. Docker and Kubernetes healthchecks point here — the pod is marked unhealthy if any MCP server is disconnected.

curl -s http://localhost:3978/api/health | python3 -m json.tool

curl -s https://your-domain.example.com/api/health | python3 -m json.tool

Expected response (all systems operational):

{
  "status": "ok",
  "adapter_initialized": true,
  "db": true,
  "mcp": {
    "release": { "connected": true, "tools": 38, "error": null },
    "jira": { "connected": true, "tools": 29, "error": null }
  }
}

Field	Meaning
`status`	`"ok"` (HTTP 200) or `"degraded"` (HTTP 503)
`adapter_initialized`	Teams Bot Framework adapter is ready
`db`	PostgreSQL connection successful
`mcp.*.connected`	MCP server is connected and responding
`mcp.*.tools`	Number of tools registered by the MCP server

After a deployment, MCP sidecars need 20–30 seconds to connect. The health check returns 503 during this startup window.

Metrics & Monitoring

Reva exposes metrics in two formats. No authentication is required for metrics endpoints.

Prometheus Format

GET /api/metrics    # text/plain; version=0.0.4

Scrape this endpoint with Prometheus, Grafana Agent, or any compatible collector. Key metrics:

Metric	Type	Description
`reva_request_duration_p50_seconds`	Gauge	Message handling time (p50)
`reva_requests_total`	Counter	Total messages by status (success/error)
`reva_llm_step_duration_p50_seconds`	Gauge	LLM inference time (p50)
`reva_agent_max_steps_aborts_total`	Counter	Agent loops that hit the iteration limit
`reva_mcp_tool_duration_p50_seconds`	Gauge	MCP tool call duration by server
`reva_active_sessions`	Gauge	Currently active conversation sessions
`reva_notification_deliveries_total`	Counter	Proactive notification deliveries

JSON Format

GET /api/stats      # application/json

Same data as structured JSON — useful for dashboards, scripts, or ad-hoc inspection.

Support Bundle

The support bundle collects GDPR-safe diagnostic data for remote troubleshooting. It is designed for customer deployments where X-idra has no SSH access. Two collection methods are available: an API endpoint (when the application is running) and a shell script (when it is not).

Setup

Set a shared secret to enable the support bundle endpoint:

Add to your .env file:

REVA_SUPPORT_SECRET=your-secret-here

Then restart:

docker compose up -d reva

Patch the ConfigMap:

kubectl patch configmap reva-env -n reva \
  --type merge -p '{"data":{"REVA_SUPPORT_SECRET":"your-secret-here"}}'
kubectl rollout restart deployment/reva -n reva

API Endpoint (Application Running)

Collect diagnostics as JSON or a downloadable ZIP archive:

# JSON response
curl -H "X-Support-Secret: $REVA_SUPPORT_SECRET" \
  https://your-domain.example.com/api/support-bundle

# ZIP archive (one file per section)
curl -H "X-Support-Secret: $REVA_SUPPORT_SECRET" \
  https://your-domain.example.com/api/support-bundle?format=zip \
  -o support-bundle.zip

What It Collects

The bundle runs 11 independent collectors. If one fails, the others still complete.

Section	Data Collected
`system_info`	OS, kernel, architecture, hostname, RAM, CPU, disk, Python version, installed packages
`config`	Reva + Renfield settings (secrets masked), relevant environment variables
`health`	Database, MCP servers, Teams adapter, Ollama, Redis connectivity
`metrics`	Full metrics snapshot (request performance, agent loop, LLM, MCP, webhooks)
`mcp_status`	MCP server connectivity, tool list per server
`router_state`	Agent router roles, models, descriptions, MCP bindings
`db_stats`	Table row counts, active connections, database size, connection pool stats
`ollama`	Available models, running models with VRAM usage
`network`	DNS resolution + TCP connectivity tests for all configured services
`logs`	Last 500 log lines + last 200 error/warning lines (sanitized)
`error_summary`	Error counts, agent aborts, MCP failures, notification failures

GDPR Sanitization

The support bundle is designed to be safe for sharing across organizational boundaries:

Data Type	Treatment
Passwords, tokens, API keys	Replaced with `***`
Environment variables matching `PASSWORD`, `SECRET`, `TOKEN`, `KEY`, `APP_ID`, `TENANT`	Values replaced with `***`
User names in logs	Replaced with `[REDACTED]`
Session / conversation IDs in logs	Truncated to 8 characters
Database content (messages, memories)	Never queried — only `pg_stat_user_tables` row counts

Authentication: The endpoint returns 403 if REVA_SUPPORT_SECRET is not configured, and 401 if the X-Support-Secret header does not match. Without the correct secret, no diagnostic data is exposed.

Offline Shell Script (Application Not Running)

When the application cannot start or the API is unreachable, use the shell script. It auto-detects whether the deployment uses Docker Compose or Kubernetes.

# Without API access (collects system info, logs, container status, DB stats)
./bin/support-bundle.sh

# With API access (also pulls the full API bundle)
REVA_SUPPORT_SECRET=your-secret ./bin/support-bundle.sh

The script produces a support-bundle-YYYY-MM-DD-HHMMSS.tar.gz archive in the project root directory containing:

System information (OS, memory, disk, Docker/K8s version)
Container/pod status and resource usage
Application logs (last 1000 lines)
Health check response (if reachable)
Configuration with secrets masked
Database statistics (row counts, connections, size)
Full API bundle (if secret provided and API reachable)
GDPR_NOTICE.txt documenting what was sanitized

Send the resulting .tar.gz or .zip file to info@x-idra.de for analysis. The bundle contains no personal data, conversation content, or credentials.

Database Backups

# Manual backup
./bin/backup-db.sh

# Automated daily backup (add to crontab)
0 2 * * * /path/to/reva/bin/backup-db.sh

Backups are saved as gzipped SQL dumps to backups/reva_YYYY-MM-DD_HHMMSS.sql.gz with 30-day automatic retention.

# Automated: CronJob runs daily at 02:00 UTC
kubectl get cronjob db-backup -n reva

# Manual backup
kubectl create job --from=cronjob/db-backup db-backup-manual -n reva

# Check backup logs
kubectl logs -n reva job/db-backup-manual

Backups are stored on the persistent volume. For off-cluster backup, mount an additional PVC or configure an S3 upload in the CronJob.

Backups use the postgres superuser account (not the restricted reva app user) to ensure all schemas and permissions are captured.

Log Management

# Follow Reva logs
docker compose logs -f reva

# Search for errors
docker compose logs reva | grep -i "error\|warning\|critical"

# Follow specific MCP server logs
docker compose logs -f reva 2>&1 | grep -i mcp

Log Rotation

Docker log rotation is pre-configured in docker-compose.yml (json-file driver):

Service	Max Size	Max Files	Total
reva	50 MB	5	250 MB
postgres	20 MB	3	60 MB
redis	10 MB	3	30 MB

# Reva application
kubectl logs -n reva -l app=reva -c reva -f

# MCP sidecars
kubectl logs -n reva -l app=reva -c release-mcp -f
kubectl logs -n reva -l app=reva -c jira-mcp -f

# Previous pod logs (after crash)
kubectl logs -n reva -l app=reva -c reva --previous

# Search for errors
kubectl logs -n reva -l app=reva -c reva | grep -i "error\|warning"

Kubernetes manages log rotation through the container runtime. For long-term log retention, configure a log aggregator (Loki, Elasticsearch, etc.).

Updates & Rollback

Update

# Update version in .env
REVA_VERSION=1.0.5

# Pull new image and restart
docker compose pull reva
docker compose up -d reva

Rollback

# Revert to previous version
REVA_VERSION=1.0.4
docker compose up -d reva

Update

# Build new image and import
docker build -t reva:latest .
docker save reva:latest | sudo k3s ctr images import -

# Restart deployment
kubectl rollout restart deployment/reva -n reva

# Watch rollout
kubectl rollout status deployment/reva -n reva

Rollback

# Rollback to previous version
kubectl rollout undo deployment/reva -n reva

# Check rollout history
kubectl rollout history deployment/reva -n reva

Before updating: Always create a database backup first. The application runs database migrations automatically at startup, and some migrations may not be reversible.