Jump to: navigation, search

SMILA/Project Concepts/Blackboard Service Restructured

Restructuring the Blackboard Service

Why all this?

While thinking about the fix for 269967 it occured to me that the current blackboard service archtitecture is too simple. The problem there is that a workflow creates multiple records on the blackboard from the input object by splitting it. All of these records must be removed (after committing them to the storages in case of successful processing) after processing from the blackboard again to release the used memory again. This is easy in case of a single split in the pipeline, because then the listener knows all pipelet IDs: one input ID and the output IDs. But imagine that the pipeline splits the first split results again: then the IDs of the first splitting would not be returned to the listener and it could not care about committing/invalidating them. They would stay on the blackboard which results in a memory leak.

There are several ways to solve this:

  1. Let the pipelets splitting the record care about what to do with the source records: Apart from introducing a lot of error potential for badly implemented pipelets (and making pipelet programming harder in general), this is problematic because one pipelet usually cannot decide if other pipelets might still need the source records.
  2. Manage dependencies between reords on the blackboard so that all element/fragment records will also be removed when the source record is removed: Seems error prone, too: How to handle error cases? What if someone else already wants to process a split result when the source record is finished? Also, a record may be created in other ways than splitting so that no dependency is recorded.

And additionally, there may also be error cases which prevent a record to be correctly removed from the blackboard, so there is always a potential of memory leaks.

Next, I think the current design introduces problems with synchronous access to a single record from two seperate pipelines.

Finally, the Search service currently needs a blackboard implementation that does not persist the records. It uses an own implementation of the Blackboard interface that is not linked to any storage and just keeps everything in memory. This works but it would probably be nicer to have all blackboard stuff in a single place. And other service might have use for such a "transient" blackboard implementation, too.

Proposal

The following proposal might solve these problems (or at least be a starting point):

  • Instead of a single Blackboard service we create a BlackboardFactory service. The factory is linked to binary and record storages optionally and runs as a Declarative Service.
  • The factory can create Blackboard instances which are either "transient" (pure in-memory implementation, not using any storages) or "persisting" (linked to binary storage and optionally to record storage). The client selects which kind of blackboard it wants to use. A persisting blackboard can only be created successfully, if at least a binary storage is known. Creation of transient blackboards is always possible.
  • For each "session" an own new blackboard instance is created that manages only those records worked on by this request. A session is for example:
    • a single task list execution of a QueueWorker router or listener (i.e. add/delete one record in Connectivity, or processing one input record from a queue message and manage all additional records created by the invoked workflows)
    • a single search request in the search service.
  • After the session the blackboard instance is released completely, thus freeing any memory resources automatically without interfering with other blackboard sessions.

New interfaces

/**
 * Extension of existing BlackboardService interface,
 * but not a service (in the OSGi sense) anymore.
 */
interface Blackboard {
  //
  // this interface contains all methods of current BlackboardService interface, plus:
  //
 
  /**
   * commit ALL records on this blackboard to storages (if any) and release resources
   */
  void commit(); 
 
  /**
   * remove ALL records from blackboard and release all associated resources
   */
  void invalidate();
}
interface BlackboardFactory {
  /**
   * create a new non-persisting blackboard instance. 
   * This method must always return a valid empty blackboard instance.
   */
  Blackboard createTransientBlackboard();
 
  /**
   * create a blackboard able to persist records in storages
   * @throws BlackboardAccessException no persisting blackboard can be created, because 
   * not even a  binary storage service is available (record storage remains optional)
   */
  Blackboard createPersistingBlackboard() throws BlackboardAccessException;
}

Impact on existing code

Most code could be left unchanged after this change (apart from renaming the current interface "BlackboardService" to simply "Blackboard" - and we could add a deprecated interface "BlackboardService" extending this new one for compatibilty), because the new blackboard interface has all current methods, too. Only the access to the blackboard in the QueueWorker and SearchService would have to be changed. And of course we would have to find places where the final commit() or invalidate() would be called. No changes are necessary in pipelets or processing services.

Further usage

  • The QueueWorker implementation is currently intented to support operation without a blackboard, too, by working directly with records. This could be changed to use a transient blackboard instead. I think this would make the QueueWorker (the TaskListExecution service especially) code a lot simpler: there are many conditions now to decide if a record must be synced to a blackboard etc. Some code in the QueueWorker would have to be refactored to use "Id" instead of "Record" in method signatures for this.

Later extensions

  • The createPersistingBlackboard() could be extended to support blackboard persisting into different Storage Points by adding a method parameter naming a storage point ID to use.
  • Maybe we can use the BlackboardFactory to add caching of records/attachments over multiple blackboard sessions.

Implementation

The proposal is implemented for review in a branch of our repository:

http://dev.eclipse.org/svnroot/rt/org.eclipse.smila/branches/2009-04-06_jschumacher_Blackboard-Restructuring

Discussion

  • User:Daniel.stucky.empolis.com: I like the idea of restructuring the BlackboardService. I encountered some of those issues during development of the chemical information extraction demo were Records are splitted into elements and fragments. Is it possible to hide the commit() and invalidate() methods so that these cannot be used from within Pipelets ?
    • Juergen.schumacher.empolis.com Everything is possible (-; We could introduce another interface that extends Blackboard and adds these methods, but I do not really like this approach, it makes the API too complicated. I think it should be sufficient to just add to the javadoc of the methods, that these methods are not intended to be called by pipelets.
  • User:Andreas.Weber.empolis.com: Hi Jürgen, that sounds very good for me, too. Like Daniel, I also had some open questions in SMILA usage when splitting records in a Pipelet (e.g.: Should I remove the source record? Do I have to care about blackboard snychronization? etc). So I like the idea of having a "framework" that supports a user with that, or even better, releases a user from having to think about it at all. ;) BTW, I think it's very important to clarify the lifetime of a blackboard instance (resp. "session"). In your three examples, what about simplifying the first and second example by just saying: "In a Queueworker (router and listener), the session's/blackboard's liftetime is a single task execution" ?
    • Juergen.schumacher.empolis.com Done. But I've changed it to "task list execution", because in our code a "Task" is either a "Process" or a "Send", so "single task execution" could sound like creating a new blackboard between a "Process" and "Send" in the same listener task list.
  • Churkin.ivan.gmail.com I want to make only one remark: QueueWorker have "Record" signature because its stored Record in queue ( mainly it contains only Id ). Sometimes ago it was planned to do XPath message selectors support. And now ability to store whole record in queue used to organize cluster.
    • Juergen.schumacher.empolis.com The QueueWorker still can do the same. In the branch I only changed the interfaces of the actual QueueWorker tasks such that the TaskListExecutionService first puts the incoming record on a Blackboard instance (optionally a transient one such that it is not written to any storages) and using only IDs to call the tasks instead of the complete records. If the task then needs the complete record, it can easily get it from the blackboard instance used in this execution. Or did I miss the point of your comment?