Error Management In Orchestrated Workflows — The Case Of Infinitic

Gilles Barbier
6 min readJan 17, 2025

--

Error handling in distributed systems is inherently complex due to the variety of potential issues that can arise across different services. Typically, this requires teams to implement sophisticated tracing systems to monitor and diagnose errors.

Infinitic simplifies this process by automatically tracking error chains and even allowing for error handling directly within workflows, streamlining the overall management of failures.

Infinitic is an event-driven orchestration framework designed for Java/Kotlin applications. It lets you code workflows as if they were in a single server, while actually running them in a distributed, event-driven architecture powered by Apache Pulsar.

With the release of version 0.17.0, Infinitic introduces powerful new features that make managing errors in workflows easier than ever:

  • Built-in reliability: no message loss, at-least-once delivery, graceful recovery
  • Automatic retrying of transient failures
  • Automatic propagation of errors to the the parent workflow
  • Manual or automatic error management at the workflow level.

Introduction to Infinitic and Workflow Orchestration

Infinitic is built around Apache Pulsar, a distributed messaging system that ensures reliable message exchange between components. Tasks in Infinitic are processed by Service Executors, which are responsible for specific domains (such as Inventory, Payment, or UserManagement). Workflows, which represent high-level orchestrations of tasks (such as fulfilling an order), are managed by Workflow Executors.

Both Service Executors and Workflow Executors can be horizontally scaled, meaning that additional instances can be added to handle increased loads, ensuring system robustness. However, as with any distributed system, errors can occur at various stages, whether due to bugs, network issues, overloaded services, or delays in human interactions. This is where Infinitic’s error management capabilities come into play.

General Guarantees with Pulsar

Infinitic relies on Pulsar to ensure a reliable messaging infrastructure, providing the following key guarantees:

  • No message loss: Pulsar ensures that no messages are lost. Multi-region clusters can even be configured to mitigate catastrophic risks, such as the loss of all ledgers in a region.
  • At-least-once delivery: While Pulsar typically delivers each message exactly once, there are rare cases where a message might be consumed more than once.
  • Graceful recovery from crashes: If a consumer (such as a Service or Workflow Executor) crashes due to hardware issues or other failures, the messages being processed will be released and eventually consumed by another instance.

These guarantees form the foundation for how Infinitic ensures that tasks and workflows are reliably processed. But what happens when things go wrong, and a task or workflow fails?

Handling Transient Failures

In any distributed system, transient failures are common. These can include network issues, overloaded services, or temporary API downtimes. Infinitic handles these situations by automatically retrying failed tasks:

These retries are customizable and can be fine-tuned based on specific use cases, offering flexibility in how failures are managed. Additionally, Infinitic provides a task ID that can serve as an idempotent key, which is useful if ensuring that tasks are not repeated is crucial for your business.

However, retries don’t guarantee that a task will eventually succeed. That’s why it’s essential to have a strategy in place for managing task failures within workflows.

Workflow-Level Error Handling

When a Service Worker fails to process a task after several retry attempts, Infinitic sends an error message to the requesting workflow. By default, this will cause the workflow to fail if the task was requested synchronously:

In contrast, if the task was requested asynchronously, the failure will not immediately cause an error in the workflow:

However, if the workflow later attempts to access the result of a failed asynchronous task, an exception will be triggered:

In some extreme cases — such as when Service Executors are permanently down or unavailable for a prolonged period — the failure message may not be delivered. To handle this, Infinitic allows users to define timeouts at the workflow level. If the result of a task is not received within the specified duration, the workflow will trigger an error:

(Note: The timed-out message is managed as a delayed message by Pulsar itself.)

Manual Error Handling Flow

When a workflow fails due to an exception, the typical flow is:

  1. DevOps awareness: The issue is detected through the system’s logging mechanism.
  2. Issue resolution: DevOps decides how to resolve the underlying issue (e.g., fixing a service bug, scaling up resources).
  3. Retry failed tasks: Once the issue is fixed, DevOps can use the Infinitic client to retry the failed tasks.
  4. Automatic resume: After the task succeeds, the workflow instance automatically resumes from where it left off.

Automatic Error Handling with Try/Catch

Infinitic also supports automatic error handling, making it possible to manage task failures directly within the workflow. This is particularly useful in situations where a failure is expected (e.g., a product is temporarily out of stock).

You can use a try…catch block within a workflow to ensure that the process continues even if a task fails. For example, consider a scenario where a task called taskA fails due to a business constraint. Surrounding the call with a try…catch ensures the workflow continues, even if taskA fails.

try {
TaskResult result = myService.taskA(data)
} catch (TaskFailedEception e) {
// react to the failure
}

This approach enhances the flexibility and resilience of workflows, allowing them to handle failures more gracefully.

Error Handling for Asynchronous Tasks

As previously mentioned, asynchronous tasks do not immediately trigger workflow errors. However, if you want to react to errors in asynchronous tasks, you can manage them by running the try…catch block for the asynchronous task within a child workflow or as part of another method in your main workflow.

This enables you to handle errors asynchronously while maintaining control over the main workflow’s execution (see the doc):

Error Propagation To Parent Workflow Or Client

Infinitic also automatically propagates workflow exceptions to the calling workflow:

Or to the client that waits synchronously for it:

Moreover, the exception raised includes comprehensive details of the entire chain of errors that led to the current situation. This provides an opportunity to fully understand the root cause and respond appropriately, if necessary.

Conclusion

Infinitic’s robust error management capabilities simplify the process of handling failures in distributed systems. By automatically retrying transient failures, tracking errors, and enabling flexible workflow configurations, Infinitic ensures that complex workflows remain reliable, resilient, and versatile.

Whether you choose to manage errors manually or leverage Infinitic’s built-in automatic error handling, the platform offers a highly flexible approach to error management, making it easier to build and maintain fault-tolerant systems.

We encourage you to give Infinitic a try. Whether you’re building a new feature or optimizing an existing system, Infinitic offers the tools you need without the complexity. For regular updates, best practices, and in-depth tutorials, consider subscribing to our Substack!

--

--

Gilles Barbier
Gilles Barbier

Written by Gilles Barbier

Making distributed systems and workflows easy at https://infinitic.io. Proud dad

No responses yet