ALF Reliability Approach
ALF is a framework that seeks to allow distributed tools to be coordinated. It is the nature of distributes systems that, at some point, one part of another of the system will be unavailable for some reason. This paper discusses how an ALF based system might address this fact.
The reliability of an ALF based system primarily revolves around the ability to deliver a message between the various participants in the system.
Technologies to Address Delivery failure
There are three possible technologies and techniques to address delivery failure
a) High Availability
c) Message queues (and Service Bus)
a) High Availability
Using multiple symmetric load balanced servers can make a service highly available. This does not guarantee delivery but, assuming proper management, it makes the possibility of the service being unavailable unlikely. It is relatively easy and inexpensive to configure, offers many choices and is (mostly) transparent to ALF and the participating tools. It also addresses scalability.
The exemplar infrastructure used for ALF "out of the box" is based on open source, such as Tomcat. For example, the EventManager has been developed and tested using Tomcat. While Tomcat's scalability has improved over releases, Tomcat can be configured for higher availability in two ways:
1. Using it as a ServletContainer for either Microsoft IIS or Apache 2. Tomcat 5 can be configured for clustering and load balancing
ALF is architected to use standards whereever possible. By conforming to the specification for Java Servlets, ALF components can (in theory) run in any Servlet Container. For ALF installations where greater scalability and reliability is needed, commercial application servers, such as IBM's WebSphere, BEA's WebLogic, typically offer several mechanisms for load balancing and failover and numerous options for tuning.
This protocol spec can guarantee delivery but it requires both ends to use the protocol. This spec is still out there from a practical point of view but offers the most seamless "guaranteed delivery" for the future. I suspect that to really "guarantee" deliver it must be combined with a store and forward mechanism
c) Message Queues
Message queues can guarantee delivery but require both ends to integrate with some form of message queue client (eg via JMS). It is well proven and widely available but it requires fairly extensive integration. Message Queuing offer the best immediately practical solution to guaranteed delivery but it is intrusive to the tool possibly making it a non starter for some cases. Also queue configuration could get complex to realize the benefit.
Note on Service Bus platforms: In theory a Service Bus package could be used to provide the Message Queues and other infrastructure required by ALF. The main problem with a Service Bus is that it is more an idea that has multiple expressions than a standard platform. Consequently it seems that one would have to create an implementation specifically for a particular Service Bus. Some standards exist but they are far from universally adopted. ALF has deliberately avoided choosing a particular service bus. Instead ALF is concentrating on defining a set of interfaces based on adopted standards.
Delivery Failure in ALF
There are three communication paths in ALF where a delivery failure may occur:
1. from Tools to Event Manager
2. from Event Manager to ServiceFlow (BPEL) engine
3. from Service Flow (BPEL) engine to Tools
From Tools to Event Manager
For "Tool to Event Manager" the suggestion is to make it "highly available" by using parallel symmetric servers in a load balanced farm.
From Event Manager to ServiceFlow (BPEL) engine
For "Event Manager to ServiceFlow (BPEL) engine symmetric web farms are a bit more problematic since Service Flows can be stateful. Again a practical implementation of WS-Reliable messaging could solve the problem. However in the shorter term this seems like a good place to use a Message Queue since we have control of the Sender and the possible receivers (BPEL engines) may either already be addressing the issue or may be reasonably easy to wrap. This is one place where the event manager may provide some SPI or wrapper to enable some kind of persistent queuing in the 1.0 timeframe.
From Service Flow (BPEL) engine to Tools
For "Service Flow (BPEL) engine to Tools" there is no one solution since it depends very much on the tools. For any particular case one of the standard solutions may be provided but even if they all do that won't necessarily get us what we want. We really have to look at the particular application and what we are trying to do to address the problem appropriately. The usual problem we are probably trying to address is that we wish synchronize two or more of tools in some way. Ideally we would do that in a distributed acid transaction but, generally, that is not available. In lieu of that, we can use the Service Flow to update all parties and use "compensation" to handle failure such that we are at either able to return all parties to their original state or indicate in some meaningful and obvious way that the synchronization did not complete. To achieve this we have to regard the Service Flow as having complete control of all the state changes including that of the tool that raised the event. The Service flow must be able to "undo" the state change that caused the event that ran the service flow in the first place. Note that "undo" here may not be a literal "undo" but could be just the recording of an indication that further action is required. What can be done in any particular case may depend on how the tools work and the particular application being implemented.
Centralizing of "transactions" such that they are logically contained within a service flow is really the critical issue. With the exception of the Event Manager to BPEL engine communication, the question of guaranteed delivery may revolve around what, if anything, a tool might do if its attempt to raise an event fails. If a tool is entirely passive then guaranteed delivery is important. If the tool can detect and act upon a communications failure then guaranteed delivery may be less important and possibly not important at all depending on the persistence model of the tool. It may be very appropriate for a tool that controls the human workflow to have the option of acting upon Event Failure in its workflow logic. For other tools such as SCM the passive approach may be all that is practical or perhaps desirable.
ESB's as an alternate approach to scalability
A very promising development is ESBs, both commercial and open source, for providing scalability and reliability. A future potential direction for ALF is to run on top of ESBs, where ESBs may also provide mechanisms for accessing tool resources that are not exposed as web services and may subsume some of the routing and filtering duties the ALF EventManager provides today, and may provide some of the mediation services (e.g. data mapping) that are encoded in BPEL Service Flows today.