Talk:COSMOS Design 197867
(Hubert's comments are within the <Hubert's comment>...</Hubert's comment> markers)
First, some general comments. One of the nice things that SDMX does is make a distinction between the structure of a set of data and the source of the set of data. If you take a look at the SDMX DataFlow structure, you'll see that it breaks down into two distinct concerns. The first is the type of data (the keyset/dimension concern), and the second is the type of entity that created the data (the source concern). It makes tons of sense to me to see information related to the first concern being pushed up from DC into CMDB land, and the information related to the second concern being pushed down from CMDB into DC land. The nice thing about supporting this type of symmetry is that you can use the results of a CMDBf graph query as input to a DataBroker graph query.
<Hubert's comment> Joel, can you explain why data structure specification flows from DC to CMDB? The data direction of the two concerns are not very intuitive to me. To me, all metadata information come from data managers or MDRs. The first concern (data structure) is provided to the broker (federating CMDB) by data managers (MDRs) when they register.
<Joel's response>Sure. The Data Broker is the real 'owner' of the DC keyset information. If the DataBroker acts as an MDR for the purpose of supporting CMDB federation, it makes sense to me that the bit of metadata that's being federated out of the DataBroker MDR is this keyset information. The CMDB shouldn't have to know about all of the DataManagers - let the DataBroker act as the aggregator.</Joel's response> </Hubert's comment>
- How does the query work? Need better example of how to query for data.
- Need example of how to map existing data to SDMX structure.
- SDMX DataSet definition - DataFlow URI+ DataSource URI+ Time period
- SDMX DataFlow definition - DataSource URI + Keyset URI
- SDMX DataSource definition - DataSourceURI + DataSourceTypeURI
- SDMX DataSourceType definition - DataSourceTypeURI
- SDMX Keyset definition - Keyset URI + a list of DimensionURIs
- SDMX Dimension definition - DimensionURI + Type enumeration. SDMX restricts the type of information that a dimension may hold to enumerations and simple scalar types.
- We've completely ignored the SDMX concepts um... concept, as it has a high degree of overlap with our intended use of SML/CML.
<Hubert's comment> Regarding mapping existing data to SDMX, we need to take care of the usecase where I have an existing database of system management data (stat data, log data, or hardware config data). The data is collected by existing applications, such as TPTP or IBM Tivoli products like ITM. In these cases, we need to a generic way to map the data models of these existing databases to the SDMX data model. In many cases, the mapping is non-trivial. We will also need to have extension code written to process SDMX queries into native query languages, and transformers to convert native data structures to the SDMX data formats (e.g. from jdbc result set to ket family and key concepts). These extra mapping and processing logic will become a performance issue.
<Joel's Response>This is only an issue if we make it one. If the existing dataset just provides a description of it's record format in 'SDMX' form, then there's nothing to map. Also, our assemblies support Transformers, so we can always deal with this issue at the component level. </Joel's Response> </Hubert's comment>
I think that most of our mapping issues can be solved by deciding what time periods we want to support (for a lot of cases it may be 'for all time' as a means of effectively saying that there is only one dataset), and at what level we want the URIs to actually point to more granular definitions. For example, if we say that a WEF event's URI is a Keyset URI instance, and that the WEF event's URI is terminating (in other words, doesn't have a correpsonding set of Dimensions), then we can effectively map our type information into SDMX by truncation. All that's required it to relax the schema requirements on the higher level constructs in SDMX to be 'could have' instead of 'must have'.
<Hubert's comment> I don't like the idea of truncating the SDMX model at a point beyond which it can't be used efficiently. For example, represent WEF events up to the keyset, and don't map attribute of WEF events to keys. We are simply saying SDMX is not suitable for all data types. It won't be very meaningful to just map a WEF event to a keyset because we can't use form an SDMX query with the contents of the event attribute values, which are very important in forming the search criteria in a query. (e.g. get all events with high severity)
<Joel's Response> So map the bits of WEF that make sense into a keyset, and truncate the rest. There's no reason to take an extremist stance here. Of course, you could also (once you've determined the fact that the data format is WEF) use the convinience API that would be surfaced as a MUSE capability on the DataManager endpoint to do your more granular query. Remember, we have the capability model to exploit - there's no reason to force fit anything here. </Joel's Response> </Hubert's comment>
- Have we agreed that SDMX is the default model we support? Are we talking about query models? If so, then no - I don't think that SDMX is appropriate as our default, or at least given the proposed mapping solution outlined above, we won't be able to support the full SDMX query interface. Of course, providing limited support for SDMX queries as a WSDM capability on the DataBroker should be feasible.
- What about other models, e.g. SQL?
- Does the user NEED to understand SDMX to use this? It would be helpful to understand the spirit of SDMX, at least enough to know what an
appropriate mapping would be.
<Hubert's comment> I would suggest that we do not mention the term "SDMX" in any of the COSMOS documentation, even if we are borrowing some concepts from it. Here are the reasons:
- The SDMX concepts can be hidden behind some easy-to-use APIs.
- It's a partial implementation (in fact a very small subset). People who really know about SDMX will say we are not using SDMX.
- Most people have not seen the term "SDMX" in their lives, and let's not say they need to learn something new to use COSMOS. It will affect adoption acceptance.
- How does the design support other things than SDMX. I've vote for using the WSDM capability model, so you can always support additional requirements through composition.
- How do we reconcile the CMDBf query structure? This is where we can start to do some interesting things. For example, CMDBf has the concept of a 'Graph Query', which allows you to do pattern-oriented queries against the CMDB structure. This is really handy, because what we have in DC (and SDMX, for that matter) is pretty much unaware of structural implications - particularly at the instance level. So imagine that we use a CMDBf query against a compliant CMDB to identify things that are "close to" the problem we're trying to track down (and we can imagine one or more definitions of 'close to' based on puggable heuristics or norms). If we've done a proper job of mapping our SDMX-like URIs, then we are well set up to support a DataBroker query that can provide a set of EPRs that have information "close to" what's needed for problem resolution. So if we implement the DataBroker as an MDR (which means supporting the CMDBf Query API), we can use the implementation to return the set of DataManager EPRs as Target Items in the GraphQueryResponse structure. I think that's pretty cool!
In order to close on the broker design, we need more clarity in the broker design on the following area:
- Content of the registry: How the data manager EPRs are indexed in the registry.
- How will the data broker expose the registry contents to the clients (Broker API for clients)
- How do data manager register with the data broker (Broker API for data managers)
Please see my comments at Talk:COSMOS_Design_197870 on broker APIs. Please include the broker API in the design doc of 197867 because I think they are part of the broker design.
If we are looking at a simpler broker registry design for i6 that doesn't take into account of CMDBf and other standards, I have the following proposal for consideration. (When we come up with a broker implementation that support CMDBf query and a more elaborate index system for data manager or managed resources, we can replace the broker component with the newer design.)
The registry will be a simple lookup table of EPRs indexed by keywords.
The client will look up data managers by keywords. The broker will return all EPRs assoicated with the keyword the client provided in the request. Data managers will provide a keyword that describes the data that it is providing during registration. This design will make the EPR query and data manager registration APIs very simple.
This approach does not use SDMX data model at all. While I don't oppose the use of SDMX data model in the broker, SDMX will make the API and process for data manager registration and EPR queries a lot more complicated. I just want to show a very simple design here for consideration, which I think is in line with the proposed API in design doc of 197870.
(These uses cases are taken from a presentation I prepared for last Friday's meeting. We didn't get to this part of the presentation in that meeting. I have further simplied the use case below.)
- Broker wakes up.
- Broker undergoes initialization: Loops through the entries in the registry, ping each data manager for their status. If a data manager is not reachable, remove the entry from the registry
- Broker waits for requests from client or data managers
Data manager registers with broker
- Data manager wakes up.
- Data manager contacts broker via well-known (pre-configured) EPR.
- Data manager invokes the “registration” capability of the broker. Provide the following information (parameters) during registration:
- an identifier (ID) that identifies the data manager
- an EPR for clients to contact the data manager
- a classification keyword(e.g. hardware configuration, statistical data, etc.) It is a keyword for identifying the type of data that the data manager can provide. It will be used by the client when querying the broker for relevant data managers. (The broker keeps a dictionary of classification keywords and API for clients and data brokers to lookup. It’s analogous to "topics" in the pub/sub paradigm.)
- Broker adds an entry in the registry for the new data manager.
Data manager deregisters with broker
- Data manager invokes the deregister capability of the broker, providing ID as a parameter.
- Broker updates removes the entry that corresponds to this data manager from the registry
Client queries broker for data managers
- Client invokes an API of the broker, providing a classification keyword (e.g. hardware configuration)
- Broker retrieves EPRs of data managers from the registry that are under the desired category and returns the EPRs to the client
<Joel's Comments> I've tried to flesh out these interactions a bit with some options:
Case 1: Given a reported error on a resource, resolve a relevent set of DataManagers for that resource. Step 1: Determine what additional resources are relevent to the reporting resource Client -------------------------------> CMDB (CMDB Query capability) Graph<CMDB Resource Identifiers><---- graphQuery() Step 2: Get a list of data brokers (using either 2A or 2B) Step 2A: Get the list of know databrokers from the Domain Manager. Client -------------------------------> Domain Manager resolution (locate the well known EPR for the domain manager. Hard code/Local config/LDAP/Whatever...) Client -------------------------------> Domain Manager (WSDM) List<Capabilties><------------------- getManagementCapabilities() Client resolves COSMOS Domain Manager capability Client -------------------------------> Domain Manager (COSMOS Domain Manager capability) List<DataBroker><------------------- getDataBrokerList() Step 2B: Get a subset of databrokers from the Domain Manager based on some sort of query. Client -------------------------------> Domain Manager resolution (locate the well known EPR for the domain manager. Hard code/Local config/LDAP/Whatever...) Client resolves Resource Catalog capabilty Client -------------------------------> Domain Manager (WSDM) List<Capabilties><------------------- getManagementCapabilities() Client resolves WS-RC capability Client resolves WS-ResourceTransfer capability Client -------------------------------> Domain Manager (WS-ResourceTransfer capability) Get response<DataBroker EPRs><------------------- get(Get message with XPath filter) Step 3: Query the DataBrokers for datasets relevent to the reported error (based on time and the contents of the CMDB graphQuery response) using either 3A or 3B. Step 3A: Query the DataBrokers for datasets relevent to the reported error using the COSMOS provider capability Client for each DataBroker Client -------------------------------> DataBroker(WSDM) List<Capabilties><------------------- getManagementCapabilities() Client resolves COSMOS provider capability Client for each CMDB Resource in the graphQuery response Client -------------------------------> DataBroker (COSMOS provider capability) List<DataSet><----------------------- getDataSetsForSource() Client -------------------------------> DataBroker (COSMOS provider capability + merged endpoint support) List<DataManager EPR><--------------- getEPRForDataSets(List<DataSet>) Step 3B: Query the DataBrokers for datasets relevent to the reported error using the CMDBf Query capability Client for each DataBroker Client -------------------------------> DataBroker(WSDM) List<Capabilties><------------------- getManagementCapabilities() Client resolves CMDB Query capability Client constructs CMDB graphQuery request based on contents of CMDB graphQuery response Client -------------------------------> DataBroker (CMDB Query capability) Graph<DataSet/EPR pairs><------------ graphQuery()
I agree with Hubert that we should start with a simple Broker registry. I don't think that we should keep any keyset or device information in the Broker registry because it will introduce a synchronization problem between the Broker and the Data Managers, especially those with large and volatile data sources.
Each entry in the Broker registry should contain this information about a Data Manager:
- Resource ID of the EPR
- classification keyword
- timestamp of last contact with the Data Manager
The client can query the Data Broker for Data Managers that match the classification keyword and the dialect. The client then uses the returned EPR (plus the Resource ID) to query the Data Manager EP directly in the dialect that they both understand.
I think that that the DC framework should not dictate any specific dialect. Rather it should be flexible enough to allow clients and managers to be loosely coupled using an agreed upon dialect. The result is that a client that understands SDMX can locate a manager that understands SDMX (and the same for SML, CML, CMDBf, etc).
The Broker also maintains the timestamp of the last time that it communicated with the Data Manager. I don't think that the Broker initialization use case should ping all the Data Managers and remove the ones that cannot respond. Rather, the data managers should be responsible for contacting the Broker, which will update the timestamp of last contact.
FYI: 8/17/2007 I have updated http://wiki.eclipse.org/COSMOS_Design_197867