Skip to main content
Jump to: navigation, search

Talk:SMILA/Component Requirements/Record Binary Storage Requirements

Revision as of 08:29, 22 October 2008 by (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Tom: I wouldnt use the smila specific 'record' and 'attachment' name parts because in/out arguments are generic. Yes, indeed. There is nothing record-specific in the arguments so we do not need to talk about records and attachments. Well if you take a closer look at the page name, you'll see that page actually defines the _record_ binary storage requirements ;-) So here we are talking about the record-centric binary storage. The reason why I proposed to "tie" this concept to record is quite simple: We have to take care that our project concepts are not only technically sophisticated but also "user friendly" and easy to understand. IMO this can be achived by:

  1. Having a simple story behind SMILA: SMILA operates on (un)structured data. This data is internally encapsulated in Objects that we call records. The Records are passed from one service to another and can themselves contain (very) large amount of information (e.g. one record can represent the whole book). To be able to swiftly share this information between BPEL services we introduced the blackboard service which is actually the only client to low-level services like record binary and XML storage.
  2. Having self-explaining method names.
  3. Avoiding the "generic bug". By introducing services that are too general (i.e. can serve multiple different clients) you can confuse the person that is using them and also make them harder to implement than really needed.


  1. Simplicity of use for users is IMHO a non-argument because as u pointed out, the Blackboard is the one and only user currently. And since the Blackboard is an internal building block i suppose we should be able to handle that generic interface.
  2. Furthermore I can quite well see that plug-ins by others may want to use an infrastructure block such as the BinStore and hence a record-centric naming is rather misleading in that case then. I'm not satisfied with this, too. I think a service is not becoming specific for something by making the method names say so, but by the semantics of the signature. By using Strings as IDs and byte[]/Stream for data you already are as generic as you can get for a binary storage. But then the method name says "no, this is only for record attachments". That's strange to me. In no time there will be some other service that would like to store a Blob that is not a record attachment and then it will either misuse this service or duplicate it. If you want to have a record specific binary storage service, you should use Record IDs + Attachment names in the signature so that the service hides the calculation of "physical" blob IDs from the blackboard service. That would be OK to me. But then, I would still like to seperate the "physical storage" layer (that one that cares about storing and retrieving Blobs with String IDs) from the "record attachment storage" layer to make the physical layer exchangeable. Having said that, the physical layer does not necessarily have to be created by us, but maybe like the Eclipse File System (EFS, [1]) would be sufficient, too (as a side node, there are ideas to create an EFS provider for Hadoop DFS [2], so we would have a DFS version of binary storage quite easily). Yes Jürgen, you're right. The semantic of signature in my initial API proposal was not clear enough. Sorry. My initial suggestion was to implement the record-centric binary storage service, so the API should fully reflect that fact and therefore the correct signature should look something like this:

void storeRecordAttachment(Id id, String attachmentName, InputStream attachmentStream)
void storeRecordAttachment(Id id, String attachmentName, byte[] attachmentStream)
byte[] fetchRecordAttachmentAsByte(Id id, String attachmentName)
InputStream fetchRecordAttachmentAsStream(Id id, String attachmentName)
void removeRecordAttachment(Id id, String attachmentName)
int fetchRecordAttachmentSize(Id id, String attachmentName)

And yes, I am aware of the fact that in (near) future some service would like to store blobs too. When that case occurs I suggest to implement a dedicated service which than can be optimized for that specific client. I see nothing wrong in having more than one binary storage service. BTW: EFS sounds interesting.

Let me summarize our discussion so far. I think that right now we stand at a crossroad, either to implement a generic or a record-centric binary storage service. IMHO it would make more sense to discuss the details after we have made such a decision. So how about a vote? What about using the BinaryStore API as a framework (not a Service) in context specific wrapping Services ? In this way we could provide services with specialized APIs for example for Record processing, e.g. using Id and Record objects. Some of the code currently in Blackboard could be moved to this service (e.g. internal ID creation + hashing).

Namespaces I wonder if it would make sense to add another parameter to be able to ensure that different clients that use BinStorage for difference purposes do not have ID collisions. Otherwise one client would overwrite the BLOB of another client by accidentally using the same ID. So we could have:

void store(String namespace, String id, InputStream stream);
void store(String namespace, String id, byte[] blob)
byte[] fetchAsByte(String namespace, String id)
InputStream fetchAsStream(String namespace, String id)
void remove(String namespace, String id)
int fetchSize(String namespace, String id) IMO there is no need to introduce namespace since we in fact only have one client - the blackboard service. Alternatively we could require that IDs are not unstructured strings but something like "URIs" (maybe even javax.xml.namespace.QName instead of java.lang.String?) I would leave (like in current implementation) IDs being generated by the blackboard service as strings that are calculated like this id = hash( + attachmentName)


> namespace parameter

I had that different in mind, namely when requesting the service, the client would pass in that info. that way we could have diff. BinStoreService instances and the client wouldnt have to pass in that info each time. Ok. I suppose, creating one BinStorage service per "namespace" would be fine, too. So the API could be:

void store(String id, InputStream stream);
void store(String id, byte[] blob)
byte[] fetchAsByte(String id)
InputStream fetchAsStream(String id)
void remove(String id)
int fetchSize(String id) Here we have it - the "generic bug". Not only that the client needs to know which particular implementation of the binary storage he wants to contact to, but he also needs to know which instance has been properly configured for him. This is intended to be a matter of configuration! Hence the client doesn’t need to know it, only the admin who configures it – iff he wants to or needs smth. else than the default config.

From Georg I have received the feature request to have a means to be able partition both storages. Introducing namespaces was an idea from J:urgen and I think it is a valid one. My suggestion was to not have a parameter in the methods but handle that on instance level, which was already in my mind but never voiced this before as it wasn’t a concern up to now. Rational, if we handle namespaces at the beginning, i.e. when requesting the service, we can introduce this feature later when we have need for it as it would just be another overloaded method that allows passing in certain configurations.

In order to have a properly designed API it is better to think now about current and possible future needs and to take proper measures. IMO this is a lot better than discovering later: "Hm, could and should have seen that before.", and therefore it is in no way a "generic bug". This is particularly true with eclipse and their police to have non breaking APIs. If I understand it correctly, then the namespace for the BinStorage service is configured per service instance. E.g. we would define two services with namespace "A" and "B". To distinguish between those two on OSGi level targets must be used to filter the correct service to use (this can all be configured). Important is that those two instances are really separate (they contain different data), they cannot be used for failover. So for failover each service instance must be replicated, leading to lots of service instances (this may be a drawback compared to a namespace parameter).

Different instances

> 6. The client component must use different instances of the binary store fully transparently

@Igor: could you make this requirement more precise? This means, that the client must not know anything about the implementation details of the record binary storage service (e.g. if binary storage stores data in a file system or into a DB). OK. That part is a given to me. My question targeted more in the direction of "different instances ". Do u mean by this multiple instances of the BinStore running concurrently (leaving the case and its needs for clustering aside) or just diff. instantiations at diff. times? I suppose from your answers/comments the latter, right?

Back to the top