COSMOS DC Design Discussions
- Binding service: The binding service is used for instantiating and initializing components. The current design requires exploiters of the framework to provide a binding service for user defined components. However, the information required for the binding service can be extracted from the assembly XML and can be abstracted into the framework level. If you look at the SampleBindingService and GLABindingService, they are almost in the same pattern.
- The coupling between components
- The components should be loosely coupled.
- The current framework design uses an event model to pass events/data from component to component in the assembly.
- after each event is processed, the object is dispatched by calling the wire method of the next "component" in the assembly. This is potentially inefficient and make optimization difficult. For example, when I wrote the CBE data sink, I had to do an insert for each CBE. If I have access to the whole list of CBE objects, I can do a batch data insert operation.
- Question: Do we need to support the use case where the same data source need to be dispatched to multiple data transformers or data sinks concurrently? The current assembly XML schema design uses a containment structure for components, and allows a data source to pipe data into multiple filters, and each filter in turn pipe data into multiple data sinks. (The code example has not exploited this scenario yet.)
<filter> <sink> </sink> </fliter>
This is a more generic design but brings with it complexity. This document structure will also make it difficult to deine an XML scheme to describe the structure. We can adapt a serial design which is more straight forward:
<filter>...</fliter> <sink>...</sink> </context>
- wire method: The method name is not in the interface. Reason is probably because we need a different name for each component, and the parameter object type is also different. Assumption of the method name and introspection on the available methods and comparison with a list of acceptable object types is done. I find this process quite complicated, and potentially difficult to debug. An alternative is to use a generic method name such as process, that pass in an array of Objects. It's up to the implementation of the component to decide how to handle the object.
- Separation of concerns: There are three crosscutting concerns in the data collection code:
- assembly definition: the definition of the assembly XML files, and writing code for a assembly component (e.g. a transformer).
- metadata definition: Mapping of legacy data models to the generic metadata specification (SDMX-like model)
- WSDM management features
- Currently, the above three concerns are too intertwined, making the framework very difficult to use. The definition of datasource, keyset, etc, should separated from the assembly XML, and not cached in component implementations.
- Query framework: I find the query framework still quite ad hoc. There is no standard query mechanism. The query assembly binds a query name with a class, which implements an application specific query interface. The implementation of the query can use any method. I somehow find the concept of "inbound" vs. "outbound" in a data assembly unnecessary. We can provide standard datasource that makes SQL queries to a relation database, or XPATH queries on XML documents. These kinds of generic "data source" or "queries" can be used in inbound or outbound data collection.
- Other things that can be done for data collection:
- Hello world example for using the dc framework
- More useful data source and sinks:
- GLA datasource: The current example uses an "embedded GLA source". The GLA source can be an external GLA instance.
- Revisit CBE schema and decision regarding interoperability with TPTP.
- a more user friendly interface (command line or UI) for interacting with the dc framework.