DocsBuilding WorkflowsError Handling

Error Handling and Retries

Things go wrong — APIs time out, services return errors, data is not in the expected format. Sovereign Workflows is designed to handle failures gracefully so your automations are resilient and recoverable.

Automatic Retries

When a step fails with a retriable error (like a network timeout or a temporary service outage), the workflow engine automatically retries it.

How Retries Work

The default retry policy uses exponential backoff with jitter:

The first retry happens after a short delay (default: 2 seconds)
Each subsequent retry waits longer (the delay doubles each time)
A small random variation (jitter) is added to prevent all retries from hitting the service at the same moment
After the maximum number of attempts (default: 3), the step is marked as failed

This approach gives transient issues time to resolve while avoiding overwhelming a struggling service.

Retry Configuration

The default retry behavior can be configured at the platform level:

Setting	Default	Description
Max attempts	3	How many times to retry before giving up
Initial delay	2 seconds	Wait time before the first retry
Backoff factor	2.0	Multiplier for each subsequent delay
Jitter	Enabled	Adds randomness to retry timing

Enterprise Retry Features

Enterprise tier unlocks advanced retry capabilities: adaptive retry policies that learn from historical failure patterns, circuit breaker policies that stop retrying when a service is consistently down, and custom per-action retry policies.

Which Errors Are Retried

Not all failures trigger retries. The system distinguishes between:

Retriable failures — transient errors like timeouts, rate limits, and temporary service unavailability. These are retried automatically.
Permanent failures — errors like invalid credentials, missing resources, or bad request data. These fail immediately without retrying, because retrying would produce the same result.

Failure Edges

When a step fails after exhausting its retries, you have two options:

Let the workflow fail — if there is no failure edge, the entire workflow execution is marked as failed. This is the default behavior and is appropriate when any step failure means the whole process should stop.
Route to an error handler — connect a failure edge from the step to another node. When the step fails, instead of stopping the workflow, execution continues along the failure path. This lets you:
- Send an alert or notification about the failure
- Log the error for later review
- Take a compensating action (like reverting a previous step)
- Continue the workflow with default values

Building Resilient Workflows

For critical workflows, add failure edges to key steps. Even a simple "send an email when this step fails" can save hours of investigation time.

ForEach Failure Policies

When a ForEach node is processing a collection, you can choose what happens when individual iterations fail:

Fail Fast — stop all remaining iterations as soon as one fails. The ForEach node reports a failure. Use this when partial results are not useful (e.g., a batch update that must be all-or-nothing).
Continue — let all iterations run to completion, even if some fail. The ForEach node reports the aggregated results: how many succeeded, how many failed, and how many were cancelled. Use this when partial results are acceptable (e.g., sending notifications to a list of users — one failure should not block the rest).

Viewing Failures

When a workflow execution fails, you can investigate what went wrong:

Open the execution from the Monitoring Executions page
Failed steps are clearly highlighted in the execution detail view
Click on a failed step to see:
- The error message and category
- How many retry attempts were made
- The inputs that were sent to the step
- The timing of each attempt

This information helps you quickly identify whether the issue was transient (and a simple re-run would fix it) or systematic (requiring a workflow change).

Cancelling Executions

If a workflow is running and you need to stop it — perhaps you noticed an issue with the data or the workflow is taking too long — you can cancel the execution.

Cancelling an execution:

Marks the execution as Cancelled
Stops dispatching new steps
Does not interrupt steps that are already running (they complete normally)
Cancels any pending approval requests from Human Gate nodes

Next Steps

Monitoring Executions — track and investigate workflow runs
Node Types — understand ForEach failure policies
Expressions and Data Flow — handle data from failure edges