Sovereign Platform is in pre-launch alpha.
Not yet available to purchase. Sign up for our mailing list for upcoming launch dates.
Sovereign Platform is in pre-launch alpha.
Not yet available to purchase. Sign up for our mailing list for upcoming launch dates.
Knowing your deployment is healthy and performing well is essential. Sovereign Workflows provides health check endpoints, metrics, structured logging, and audit logging to give you full operational visibility.
Every service in your deployment exposes health check endpoints that tell you whether it is running and ready to handle requests.
| Endpoint | Purpose |
|---|---|
/health/live | Liveness — confirms the service process is running. Use this for container restart decisions. |
/health/ready | Readiness — confirms the service can handle requests (database connected, dependencies available). Use this for load balancer routing. |
A healthy liveness response returns HTTP 200. A readiness check returns HTTP 200 when all dependencies are available, or HTTP 503 when something is wrong (such as a database connection failure).
Health checks are pre-configured in the Docker Compose files. Docker automatically monitors each service and marks it as healthy or unhealthy based on these endpoints.
To check the status of all services:
docker compose ps
If deploying to Kubernetes, configure probes pointing at the health endpoints:
livenessProbe:
httpGet:
path: /health/live
port: 80
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 80
initialDelaySeconds: 15
periodSeconds: 10
The Engine API exposes Prometheus-compatible metrics for monitoring workflow throughput, step performance, and system health.
Metrics are available at:
GET http://your-engine-host:5003/metrics
| Category | Examples |
|---|---|
| Orchestration | Steps dispatched, steps completed (by status), fan-out iterations, dispatch latency |
| HTTP | Request duration, status codes, active requests |
| Runtime | Garbage collection, thread pool usage, memory |
Add the Engine API as a scrape target in your Prometheus configuration:
scrape_configs:
- job_name: 'sovereign-engine'
scrape_interval: 15s
static_configs:
- targets: ['engine:5003']
metrics_path: '/metrics'
Grafana Dashboards
Create Grafana dashboards for: active executions over time, step completion rate by status, P95 dispatch latency, error rate trends, and service health status. These give you at-a-glance visibility into your deployment.
All services produce structured log output to the console by default. Logs are compatible with any log aggregation system (ELK, Datadog, CloudWatch, etc.).
Control log verbosity via environment variables:
Logging__LogLevel__Default=Information
Logging__LogLevel__Microsoft.AspNetCore=Warning
Available levels: Trace, Debug, Information, Warning, Error, Critical.
For troubleshooting, temporarily increase the log level:
Logging__LogLevel__Default=Debug
| Log Level | Events |
|---|---|
| Information | Workflow started, step dispatched, step completed, license validated |
| Warning | Step execution timeout, rate limit exceeded, retry attempt |
| Error | Database connection failure, unhandled exception, license expired |
Enterprise tier deployments include comprehensive audit logging that records security-sensitive operations for compliance and investigation purposes.
Audit logging is configured in the Engine API:
Audit__LogToDatabase=true
Audit__LogToSerilog=true
Audit__RetentionDays=90
Audit events are stored in a dedicated database table with configurable retention. They can also be streamed to your log aggregation system via Serilog.
License Required
Audit logging requires a license with audit capabilities enabled. Without this feature, audit events are silently discarded. See Licensing for details.
| Condition | Recommended Action |
|---|---|
| Health check returns 503 | Check database connectivity and dependent services |
| Step failure rate exceeds 10% | Investigate failing actions and external service availability |
| Execution queue growing | Scale executor workers to increase throughput |
| Memory usage above 85% | Investigate potential issues or increase container resources |
| Rate limit rejections appearing | Investigate source and consider adjusting limits |
| License validation warnings | Check license expiry and Portal connectivity |