Building An Event-Driven Orchestration Engine

Problem: booking a flight + hotel

Let’s consider a typical “flight+hotel booking” process with the following steps:

  • Process payment
  • Book flight
  • Book hotel
  • Notify customer

Centralized Orchestration

Event-Driven Choreography

The current practice is to implement an event-driven architecture. Each service is connected to a message bus (hello Kafka!), subscribes to some messages and publishes others. For example, the payment service will subscribe to the “CheckoutTravelCart” command message and produce a new “PaymentProcessed” event. This latter will be catch by the flight booking service to trigger a “FlightBooked” event, catch by the hotel booking service to trigger the “HotelBooked” event, and so on.

Event-Driven Orchestration

Actually, you can extend the event-driven architecture by adding an orchestrator service. This service will be in charge of triggering commands based on the history of past events. This orchestrator maintains a durable state for each business process, making it easy both to implement sophisticated processes and to have a source of truth for each process state. Moreover, each service is even more decoupled as it does not need anymore to be aware of other service’s events. They can behave as simple asynchronous services receiving their own commands and publishing their own events only.

  • a job manager system. This job manager is responsible for processing commands. A client does the interface between the job manager and the service and ensures to send the results back to the event bus. In case of failure, the job manager should handle the retries gracefully. Ideally it should also provide a monitoring tool showing the state of each job. This monitoring tool should let an administrator cancel or retry a job. Indeed a simple failed job can block an entire complex workflow, so it’s essential to provide the tooling to fix any issue that may occur.
  • An orchestration system able to receive an event and decide which command to trigger next based on the workflow definitions and the current workflow state. This work is called a “decision” and can be delegated to a special decision service. To ensure consistency, the orchestrator must guarantee that there is never more than one decision processed in parallel for the same process. A Timer service should also be added to handle time-sensitive action (eg. “wait until next Monday 9am”).
  • Event Bus and Command Queues were a managed RabbitMQ cluster
  • Databases were large managed Postgresql instances (with hundreds of millions of lines)
  • Orchestrator and the Timer were in-house services developed in Elixir
  • Decision Service was in node.js (as workflows were defined in node.js).
  • Orchestrator was a in-memory stateful service, making it hard to scale horizontally
  • Postgresql databases became painfully slow when calculating metrics
  • RabbitMQ can not scale horizontally (but we did not reach the limit)
  • Overall the platform was already too complex

Using Apache Pulsar

Pulsar is sometimes presented as the “next-generation Kafka”. It’s already used in production at — very — large scale at Yahoo or Tencent. It has an impressive set of features:

Unified messaging systems

An ideal event-driven architecture requires both a streaming platform for managing events and a queuing system for job processing. Of course, it’s possible to have only one or the other — at Zenaton we had a unique RabbitMQ cluster — but you make your life more complicated.

Horizontal Scaling of Storage

By leveraging Bookkeeper, Pulsar provides you out of the box an horizontal scaling of your storage. Bookkeeper is used to store your messages permanently. Of course you can define a limit in retention. You can also set up a tiered storage to offload long-term storage to S3 or other providers.

Functions And Stateful Computations

As described in Pulsar documentation, functions are lightweight compute processes that

  • consume messages from one or more Pulsar topics,
  • apply a user-supplied processing logic to each message,
  • publish the results of the computation to another topic

Delayed Messages

Delayed messages can be used in the job manager to manage retries of failed jobs, but can also be used to manage delays in workflows.

Ability to SQL-Query Data From Pulsar

As Pulsar provides a way to query data stored in Bookkeeper, it’s quite easy to set up an API that a dashboard can use to display what happens in workflows. Cherry on the cake, this API will include the data offloaded to tiered storage as S3.

Conclusions

Based on my experience at Zenaton, Apache Pulsar appears to be an ideal platform to build a lean but powerful enterprise-grade event-driven workflow orchestration platform.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gilles Barbier

Gilles Barbier

1.2K Followers

Making distributed systems and workflows easy at https://infinitic.io. Previously founder at Zenaton and director at The Family — proud dad