DocsDeployment & OperationsMonitoring & Health

Monitoring and Health

Knowing your deployment is healthy and performing well is essential. Sovereign Workflows provides health check endpoints, metrics, structured logging, and audit logging to give you full operational visibility.

Health Checks

Every service in your deployment exposes health check endpoints that tell you whether it is running and ready to handle requests.

Endpoints

Endpoint	Purpose
`/health/live`	Liveness — confirms the service process is running. Use this for container restart decisions.
`/health/ready`	Readiness — confirms the service can handle requests (database connected, dependencies available). Use this for load balancer routing.

A healthy liveness response returns HTTP 200. A readiness check returns HTTP 200 when all dependencies are available, or HTTP 503 when something is wrong (such as a database connection failure).

Docker Compose

Health checks are pre-configured in the Docker Compose files. Docker automatically monitors each service and marks it as healthy or unhealthy based on these endpoints.

To check the status of all services:

docker compose ps

Kubernetes

If deploying to Kubernetes, configure probes pointing at the health endpoints:

livenessProbe:
  httpGet:
    path: /health/live
    port: 80
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 80
  initialDelaySeconds: 15
  periodSeconds: 10

Metrics

The Engine API exposes Prometheus-compatible metrics for monitoring workflow throughput, step performance, and system health.

Prometheus Endpoint

Metrics are available at:

GET http://your-engine-host:5003/metrics

What Is Measured

Category	Examples
Orchestration	Steps dispatched, steps completed (by status), fan-out iterations, dispatch latency
HTTP	Request duration, status codes, active requests
Runtime	Garbage collection, thread pool usage, memory

Prometheus Configuration

Add the Engine API as a scrape target in your Prometheus configuration:

scrape_configs:
  - job_name: 'sovereign-engine'
    scrape_interval: 15s
    static_configs:
      - targets: ['engine:5003']
    metrics_path: '/metrics'

Grafana Dashboards

Create Grafana dashboards for: active executions over time, step completion rate by status, P95 dispatch latency, error rate trends, and service health status. These give you at-a-glance visibility into your deployment.

Logging

All services produce structured log output to the console by default. Logs are compatible with any log aggregation system (ELK, Datadog, CloudWatch, etc.).

Adjusting Log Levels

Control log verbosity via environment variables:

Logging__LogLevel__Default=Information
Logging__LogLevel__Microsoft.AspNetCore=Warning

Available levels: Trace, Debug, Information, Warning, Error, Critical.

For troubleshooting, temporarily increase the log level:

Logging__LogLevel__Default=Debug

What to Watch For

Log Level	Events
Information	Workflow started, step dispatched, step completed, license validated
Warning	Step execution timeout, rate limit exceeded, retry attempt
Error	Database connection failure, unhandled exception, license expired

Audit Logging

Enterprise tier deployments include comprehensive audit logging that records security-sensitive operations for compliance and investigation purposes.

What Is Audited

Workflow lifecycle events (created, updated, published, deleted)
Execution events (started, cancelled)
Schedule and configuration changes
Connection operations (created, refreshed, revoked)
Policy changes and administrative actions

Configuration

Audit logging is configured in the Engine API:

Audit__LogToDatabase=true
Audit__LogToSerilog=true
Audit__RetentionDays=90

Audit events are stored in a dedicated database table with configurable retention. They can also be streamed to your log aggregation system via Serilog.

License Required

Audit logging requires a license with audit capabilities enabled. Without this feature, audit events are silently discarded. See Licensing for details.

What to Monitor and Alert On

Condition	Recommended Action
Health check returns 503	Check database connectivity and dependent services
Step failure rate exceeds 10%	Investigate failing actions and external service availability
Execution queue growing	Scale executor workers to increase throughput
Memory usage above 85%	Investigate potential issues or increase container resources
Rate limit rejections appearing	Investigate source and consider adjusting limits
License validation warnings	Check license expiry and Portal connectivity

Next Steps

Deployment Guide — infrastructure setup with health checks
Configuration Reference — all logging and metrics settings
Security Overview — audit trail and security monitoring