Talk:COSMOS DC Design Discussions
Hubert - thanks for these suggestions. I've got some comments/questions below. Let's do a few iterations and then get some of these into bugzilla.
>>Comments (Joel): From reading all of you comments below, it sounds like you're saying that we should remove everything from the DC runtime that doesn't look like the GLA, and convert the remainder to conform to the way GLA works.
Binding Service: Is the suggestion that we remove the BindingService, or provide a base implementation? The BindingService interface was put into the framework for two purposes. One was to provide a mechanism for allowing component providers to construct, initialize and manage their components using their own extensions to the assembly declarations, and the other was to provide an isolation mechanism for the component classloaders. The fact that the Sample and GLA binding services are so similiar is due to the fact that they don't really do much.
I would suggest removing the BindingService. We can require all components to provide a constructor with no parameters. With the factory name specified in the assembly XML, the framework can instantiate the class. If all components need to be registered with the contribution manager, it can be done in the framework code. Any extra information should go in the assembly XML in the binding sections. Information about registry should not be taken care of here. (This is related to the separation of concern comment I had.) I'm not sure what custom mechanism is required. Can you also explan the need for "isolation mechanism for the component classloaders"? The design principle is to make the framework easy to be extended. I envision users to simply write code for some components by implementing the API or reuse existing components. Then they use the assembly XML to link up the components to form the assembly.
>Comments (Joel) I think we're going to disagree on this one. The intent of the binding service was to act as the factory for classes of components. The binding service provides a place to let COSMOS consumers handle concerns like component reuse, class loader isolation and parameterized construction. Relying on createInstance() means that you'll have no way to let a consumer manage their own components, and also require that every component implement code to deal with parsing the binding information from the assembly.xml. I'm sure we could supply a default BindingService implementation that has GLA-like semantics, but I'm not sure I agree with removing the BindingService altogether.
I should mention here that TPTP already has a framework for collecting data and passing data through one or more processing units for transformation/parsing/filtering and output the data to a sink. The framework is part of the Generic Log Adapter (GLA). The GLA framework is very generic, and not specific to log analysis. It is a mature piece of code. Most part of the COSMOS assembly is a re-invention of this framework. You can find the code in the TPTP CVS (org.eclipse.hyades.logging.adapter). The GLA framework has an XML configuraiton file like the COSMOS assembly XML. TPTP provides some useful sensor, parsers, formatters and outputters which users can reuse. Users can also provide their own log parsers and hook them up in the xml file.
Component Coupling: Completely agree on the batch import. My initial plan was to add support for array types to the wiring code - just never got around to it. As you surmised, the initial implementation was geared for live data collection, not mass imports after the fact.
Again, I like to compare the design with GLA framework. GLA data sensor has an attribute called "maximumBlocking" which determines the number of data entries to buffer before passing onto the next component. GLA also has a multi-threaded implementation which has each component run on a separate thread, and allow concurrent processing by multiple components down the pipeline. I think the buffering mechanism can be useful for both live monitoring and batch imports.
I think the container should be retained. It gives the framework the capability to multiplex component pipelines and supports a richer composition model for deployment. For cases where the source does a destructive read, you can only support composition scenarios by modifying components, which will reduce reuse.
I initially made the comment more from the schema design point of view than the semantics consideration. I didn't like the telescoping structure of the document. It makes the document tree very deep. If you want to add a transformation step, for instance, you will need to add the new element as a child of the source element, and move the sink to the child of the transform element.
If you want to keep the multiplexing semantics, here is an alternative to the document structure.
<multiplex> <channel> <filter>...</fliter> <sink>...</sink> </channel> <channel> <filter>...</filter> <sink>...</sink> <channel> </multiplex> </context>
The schema defines a sequence of source, filter, ..., sink. Each element can be zero or more ocurrence. Each channle element can accept the same sequence definition.
As for the need of multiplexing semantics, I haven't seen any applications of the framework yet, so I'm not sure if it is indeed useful. But I would caution you not to overengineer the framework.
Wire Method: Yeah - upon reflection (pun intended), it probably would've been better to have the base class discover the wiring methods by introspection. I'd rather not use Object and Object array as type, however, as that means adding complexity to components that support multiple wiring types. Let's see if we can come up with a strongly typed implementation.
I think have a processEvents method in the interface that takes an Object arrary may not be a bad idea. It is a method to be invoked by the framework and is important to be declared in the interface. If you need type safety, you can force each component to declare an acceptedType, and does a type validation before each method call. If the object types don't match up between component outputs and inputs, the assembly will fail one way or another. As long as the framework can fail "gracefully" with an obvious error message, the design is ok.
>Comments (Joel) If we choose a strongly-typed, naming convention based approach, then we can create tooling that can use reflection at authoring time to validate assembly specifications and remove the need to do type validation in the processing pipelines. If we use Object as the parameter type, then we will require component authors to do validation the framework could/should do for them, and we won't catch errors until deployment time.
Separation of concern: Can you give some specifics here? I'm not sure I agree about removing all traces of Keyset/DataType from the assembly. They were put there explicitely to allow for linkages back to our SML work (or forward into any SDMX work we may do). I agree that they probably shouldn't show up in the components - that kind of information should be mediated by the binding services. As for WSDM, I think we've already got an enhancement request to support muse-generated management interfaces for components.
By "separation of concern", I mean that the COSMOS framework should be componentized such that developers who are using the COSMOS framework only need to focus on solving one problem at a time. For example, when I am devloping a source component to get statistical data from a web server or a sink to put statistical data into a database, I don't need to know about SDMX or the data registry. In the current implementation, there is code for handling SDMX data everywhere - in assembly XML, binding service and component implementations. I think it's not necessary. To use the COSMOS framework, we do need a step to map legacy data structures to the SDMX model. This can be done somewhat like the iBatis map file which maps Java files to relational table columns. You can put the mapping info in the assembly XML or in the different XML file. The dataSource, keyset and key definitions can be put under the same element in the XML. How to map data to the SDMX model? It may not be easy for some data structures (such as hierarchical and graph-like data). The current query interface in the end-to-end demo and the ones proposed by CA make heavy use of the SDMX concepts. For example, you query for data using dataset, keyset and keys. If I have legacy data that were not collected using COSMOS, how can I map data to SDMX data? If I use the API (IDataQueryService) that let me pick the query dialet, and provide a query string in that dialet, would that mean SDMX will become one of the query dialets? I'm still unclear about how SDMX will be used in COSMOS. The proposal that CA is coming up with will help with clear up some of these questions.
>Comments (Joel): Actually, we don't have any SDMX data in COSMOS. All we've done is take the concepts. For instance, the notion of "Type" and "Instance" for a source of data is something that the author of a data source component should be interested in. The notion of "Type" and "Instance" for a set of persisted data is something that the author of a data sink component should be interested in. We've tried to make these concepts opaque within DC (so that they can be related to a CMDB repository OR a true SDMX-compliant repository). Also, we've, only applied them to our demo components and their convinience APIs. There's no trace of keysets or dataflow types in our query api.
Query framework: I think we have this already. It's called the IDataQueryService. The JPA example implemented this for the EJBQL dialect, however our move to iBatis, the demand for those 'convenience' API, and our short timeline for delivering and End 2 End demo meant that adding support for an SQL dialect didn't make it into the iteration. I think adding generic support for SQL and XPath make a ton of sense.
Inbound/Outbound: I think I agree, this distinction could be made at runtime when the assembly is parsed, just by looking at the outer container (source == inbound, query == outbound). The concept (at least at runtime) is valid, as the two types of assemblies may present different control interfaces.
Comments (Hubert): To me, I can substitute the word "query" with "source". For a SQL query, the data source is the relational database, the query string is a parameter for the source component.
Other Things:More useful sources and sinks: Yes. I'd like to see the GLA source hooked up to live data as well. What impact will that have on your batch update request?
My comment about buffering applies to both live data and batch import.
Other Things:User friendly interface:We've got remote clients written (and you can always use MAX). Should we try to get Sheldon to help us make a web-based management app?
Can you give us a demo on how to use MAX to invoke the remote interfaces? I still don't know how to run the part of the code that deals with WSDM capabilities and notifications.
We may be able to have a more COSMOS-specific management user interface, for example, browse through available brokers and assemblies, start/stop assemblies, inspect metadata, etc.
Other Things:Doc: Oh yes.