COSMOS Design 209227
Comments and discussion on the talk page
The scope of this document is to define the exception and error handling in a consistent manner throughout the scope of the COSMOS project. It is expected that COSMOS will have different adoption patterns. For example, the Management Data Repository framework does not have to require the COSMOS user interface. Therefore, it is important for the logging and exception strategy of COSMOS to be compatible with existing management infrastructure. . Because COSMOS does not have any existing exception or logging facilities, it makes sense to look towards existing standards for guidelines and ideas.
In general, we want to make using and extending COSMOS as easy as possible. We often work with services that are distributed throughout the network, e.g. Data Managers. As such, there are a number of places where exceptions become important, namely the boundaries between these services.
In the figure above, there are three significant boundaries. Number 1 represents the boundary between an consuming application and the client provided by COSMOS to the data managers. Between the application and the client, a COSMOS exception will be thrown. Since this is a subclass of java.lang.Exception, the consuming application will be able to catch either this specific exception or the more generic superclass.
Number 2 is the network boundary where web services calls flow. In general, when adopters of COSMOS use the code, the web services infrastructure should not be revealed. Therefore, we do not want them to be forced to work with XML, SOAP Envelopes/Messages, et. As a result, the Data Manager Client is responsible for deserializing the incoming SOAP message, and producing the proper COSMOS exception. Note that it is possible for another non-COSMOS client to be used. In this case, the API contract would be the SOAP Fault.
Number 3 is the boundary from the data manager to the existing store where management data is persisted. This is the API that the Data Manager invokes to retrieve information. For example, when using the Management Data Repository API specified by the CMDBf specification, the MDR would convert the incoming SOAP request containing the CMDBf query to the native API (and/or query). When a problem occurs, the Data Manager captures this and creates a COSMOS Exception. If this was a remote call, this is then trapped by the WS Adapter and surfaced as a SOAP Fault.
The Oasis Web Services Distributed Management specification defines a general purpose event format and a set of situations that will be used as the logging and event structure for COSMOS. The WSDM Event Format (WEF) is derived from the Common Base Event (CBE) structure found in the TPTP project. In fact, CBE was the initial submission to Oasis and served as the starting point for WEF
One of the goals of the COSMOS is consistency between the exceptions in the code, entries that are logged, and the management events that are raised. Consistency between the logs and exceptions brings the management aspects closer to the point of origin in the code and improves manageability. A key aspect of this consistency is situations, and more specifically, situation category.
The WSDM specification defines a “situation” element that helps classify an event. The situations were derived by a “thorough analysis of event types” [MUWS, Part 2, 2.5.1]. Situations allow another dimension of classifications for events and facilitate consistent analysis across heterogeneous components, including COSMOS. All COSMOS developers should be familiar with the WSDM specification, specifically, WSDM 1.1, MUWS Part 1, section 4 , as well as WSDM 1.1, MUWS Part 2, section 2.5, and Appendix F. These sections define the situation format and present guidelines for its usage.
- [Is the burden to understand situations too much for the developer? They would need to know what situation that applies. ]
Not all fields defined by the WSDM specification for situation type are necessary for exceptions. For example, because exceptions are thrown when bad things happen, SuccessDisposion will always be “Unsuccessful”. Likewise, the “Message” field is logically the same as the private variable “detailedMessage” in Java Exception. Where COSMOS can take guidance from the WSDM specification is by creating a common exception class that captures the extra detail that can be placed directly into the logs as part of the situation.
- SituationCategory (required)
- SituationTime (optional, defaults to System time)
- Priority (optional)
- Severity (optional)
There will be a root level exception defined as part of COSMOS (org.eclipse.cosmos.common.exceptions.CosmosException). This will subclass java.lang.Exception and define protected variables for each of the four additional fields defined above. An enumerated list of values will be provided for SituationCategory and Severity. Exceptions are considered part of the COSMOS API and will conform to the API guidelines specified by Eclipse.
The use of the additional fields added by COSMOSException is strongly encouraged throughout the project. However, there are circumstances within the code where it may be difficult to determine additional classification via situation. Thus, the use of these additional fields are optional. Further, in the situations where third party users are extending the framework, we will encourage, but not require the adoption of these fields.
The main exception class in COSMOS is: org.eclipse.cosmos.common.COSMOSException. It is the intent of the COSMOS project to keep the exception hierarchy shallow, and introduce child exceptions only when necessary.
One situation where it is necessary to introduce a new subclass of exception is when a management standard defines a set of faults. An example of this is the CMDBf specification. In these circumstances, it is appropriate to map an Java exception onto the fault defined by the specification. For an example, please reference org.eclipsecosmos.dc.mdr.exception.CMDBfException.
Logging an Exception
- Joel volunteered to take first pass here.
- Some initial thoughts....
All WSDM faults implemented in Muse derive from BaseFault. Base Fault contains an 'origin' field, for designating the party responsible for raising the fault. We may want to add this as an optional field in our CosmosException base class. When not present, a value can be intuited based on the current handler of an inbound request (this will require a bit of work to support, but is do-able).
The current CosmosFault implementation appears to be situated somewhere between a Wef Event and a Base Fault. It contains situation info (Wef), but has none of the referential fields (like Reporter, etc).
The relationship between exception and logging: Exceptions are created and thrown at the origin of the problem. Logging is (usually) done at the receiver end of the exception, e.g. a catch block. Logging can also be done at situations without exceptions. Logs are used for a debugging/tracing purposes, and exceptions are programming constructs to handle and recover from error situations.
We appear to be settling on Log4j as an implementation, which is reasonable considering our relationship to Apache Muse. Log4j supports a logging delegation model based on appenders - which enables us to do things like advertise the existence of a log file or forward exceptions as wef events to an external listener, etc. by implementing our own appender for logged cosmos events.
Components: - Domain - Broker - MDR/Data manager
-> Each component will keep its own log file -> can use tooling (TPTP) to merge them when doing analysis
Log format - Java logging - muse logging (?) - any transaction ID to inject into log record for correlation
We'll require our own logging formatter to enable things like transaction injection, etc. This argues (again) for implementing a Cosmos log4j appender.
- Message catalogs
- Map Operational Status on the MDR
- Datamanager available, but application it's wrapping is not available, what's the op status of the data manager
- logfile registry & viewer
- error/warning event pub/sub
- cbe v. wef
- msg formats & IDs (valentina to help here)
Additional Topics to cover...
- stratify the conversation
- What do we do at the exact point of exception
- What do we do where we catch / recover the exception
- What is the interface between local exception handling and WSDM
- How do we interact with the logging system - it must accept these WEF events?
- How do we integrate with non-COSMOS exceptions?
- Which component will build these features?